Evolution as Backstop for Reinforcement Learning

Markets/evolution as backstops/ground truths for reinforcement learning/optimization: on some connections between Coase’s theory of the firm/linear optimization/DRL/evolution/multicellular life/pain as multi-level optimization problems.
Bayes, biology, psychology, decision-theory, sociology, NN, philosophy, insight-porn
2018-12-062019-12-01 finished certainty: possible importance: 7

One de­fense of free mar­kets notes the in­abil­ity of non-mar­ket mech­a­nisms to solve plan­ning & op­ti­miza­tion prob­lems. This has diffi­culty with Coase’s para­dox of the firm, and I note that the diffi­culty is in­creased by the fact that with im­prove­ments in com­put­ers, al­go­rithms, and data, ever larger plan­ning prob­lems are solved. Ex­pand­ing on some Cosma Shal­izi com­ments, I sug­gest in­ter­pret­ing phe­nom­e­non as mul­ti­-level nested op­ti­miza­tion par­a­digm: many sys­tems can be use­fully de­scribed as hav­ing two (or more) lev­els where a slow sam­ple-in­effi­cient but ground-truth ‘outer’ loss such as death, bank­rupt­cy, or re­pro­duc­tive fit­ness, trains & con­strains a fast sam­ple-effi­cient but pos­si­bly mis­guided ‘in­ner’ loss which is used by learned mech­a­nisms such as neural net­works or lin­ear pro­gram­ming group se­lec­tion per­spec­tive. So, one rea­son for free-mar­ket or evo­lu­tion­ary or Bayesian meth­ods in gen­eral is that while poorer at planning/optimization in the short run, they have the ad­van­tage of sim­plic­ity and op­er­at­ing on ground-truth val­ues, and serve as a con­straint on the more so­phis­ti­cated non-mar­ket mech­a­nisms. I il­lus­trate by dis­cussing cor­po­ra­tions, mul­ti­cel­lu­lar life, re­in­force­ment learn­ing & meta-learn­ing in AI, and pain in hu­mans. This view sug­gests that are in­her­ent bal­ances be­tween market/non-market mech­a­nisms which re­flect the rel­a­tive ad­van­tages be­tween a slow un­bi­ased method and faster but po­ten­tially ar­bi­trar­ily bi­ased meth­ods.

In , a para­dox is not­ed: ide­al­ized com­pet­i­tive mar­kets are op­ti­mal for al­lo­cat­ing re­sources and mak­ing de­ci­sions to reach effi­cient out­comes, but each mar­ket is made up of par­tic­i­pants such as large multi­na­tional mega-cor­po­ra­tions which are not in­ter­nally made of mar­kets and make their de­ci­sions by non-mar­ket mech­a­nisms, even things which could clearly be out­sourced. In an oft-quoted and amus­ing pas­sage, Her­bert Si­mon dra­ma­tizes the ac­tual sit­u­a­tion:

Sup­pose that [“a myth­i­cal vis­i­tor from Mars”] ap­proaches the Earth from space, equipped with a tele­scope that rev­els so­cial struc­tures. The firms re­veal them­selves, say, as solid green ar­eas with faint in­te­rior con­tours mark­ing out di­vi­sions and de­part­ments. Mar­ket trans­ac­tions show as red lines con­nect­ing firms, form­ing a net­work in the spaces be­tween them. Within firms (and per­haps even be­tween them) the ap­proach­ing vis­i­tor also sees pale blue lines, the lines of au­thor­ity con­nect­ing bosses with var­i­ous lev­els of work­ers. As our vis­i­tors looked more care­fully at the scene be­neath, it might see one of the green masses di­vide, as a firm di­vested it­self of one of its di­vi­sions. Or it might see one green ob­ject gob­ble up an­oth­er. At this dis­tance, the de­part­ing golden para­chutes would prob­a­bly not be vis­i­ble. No mat­ter whether our vis­i­tor ap­proached the United States or the So­viet Union, ur­ban China or the Eu­ro­pean Com­mu­ni­ty, the greater part of the space be­low it would be within green ar­eas, for al­most all of the in­hab­i­tants would be em­ploy­ees, hence in­side the firm bound­aries. Or­ga­ni­za­tions would be the dom­i­nant fea­ture of the land­scape. A mes­sage sent back home, de­scrib­ing the scene, would speak of “large green ar­eas in­ter­con­nected by red lines.” It would not likely speak of “a net­work of red lines con­nect­ing green spots.”…When our vis­i­tor came to know that the green masses were or­ga­ni­za­tions and the red lines con­nect­ing them were mar­ket trans­ac­tions, it might be sur­prised to hear the struc­ture called a mar­ket econ­o­my. “Would­n’t ‘or­ga­ni­za­tional econ­omy’ be the more ap­pro­pri­ate term?” it might ask.

A free com­pet­i­tive mar­ket is a weigh­ing ma­chine, not a think­ing ma­chine; it weighs & com­pares pro­posed buys & sells made by par­tic­i­pants, and reaches a clear­ing price. But where, then, do the things be­ing weighed come from? Mar­ket par­tic­i­pants are them­selves not mar­kets, and to ap­peal to the wis­dom of the mar­ket is buck­-pass­ing; if mar­kets ‘elicit in­for­ma­tion’ or ‘in­cen­tivize per­for­mance’, how is that in­for­ma­tion learned and ex­pressed, and where do the ac­tual ac­tions which yield higher per­for­mance come from? At some point, some­one has to do some real think­ing. (A com­pany can out­source its jan­i­tors to the free mar­ket, but then what­ever con­trac­tor is hired still has to de­cide ex­actly when and where and how to do the jan­i­tor-ing; safe to say, it does not hold an in­ter­nal auc­tion among its jan­i­tors to di­vide up re­spon­si­bil­i­ties and set their sched­ules.)

The para­dox is that free mar­kets ap­pear to de­pend on en­ti­ties which are in­ter­nally run as to­tal­i­tar­ian com­mand dic­ta­tor­ships. One might won­der why there is such a thing as a firm, in­stead of every­thing be­ing ac­com­plished by ex­changes among the most atomic unit (cur­rent­ly) pos­si­ble, in­di­vid­ual hu­mans. Coase’s sug­ges­tion is that it is a prin­ci­pal-a­gent prob­lem: there’s risk, ne­go­ti­a­tion costs, trade se­crets, be­tray­al, and hav­ing a differ­ence be­tween the prin­ci­pal and agent at all can be too ex­pen­sive & have too much over­head.

Asymptotics Ascendant

An al­ter­na­tive per­spec­tive comes from the : why have a mar­ket at all, with all its waste and com­pe­ti­tion, if a cen­tral plan­ner can out op­ti­mal al­lo­ca­tions and sim­ply de­cree it? Cosma Shal­izi in a re­view1 of Spufford’s Red Plenty (which draws on Plan­ning Prob­lems in the USSR: The Con­tri­bu­tion of Math­e­mat­i­cal Eco­nom­ics to their So­lu­tion 1960–1971, ed Ell­man 1973), dis­cusses the his­tory of , which were also de­vel­oped in So­viet Rus­sia un­der and used for eco­nom­ics plan­ning. One irony (which Shal­izi as­cribes to Stiglitz) is that un­der the same the­o­ret­i­cal con­di­tions in which mar­kets could lead to an op­ti­mal out­come, so too could a lin­ear op­ti­miza­tion al­go­rithm. In prac­tice, of course, the So­viet econ­omy could­n’t pos­si­bly be run that way be­cause it would re­quire op­ti­miz­ing over mil­lions or bil­lions of vari­ables, re­quir­ing un­fath­omable amounts of com­put­ing pow­er.

Optimization Obtained

As it hap­pens, we now have un­fath­omable amounts of com­put­ing pow­er. What was once a is now just a modus po­nens.

Cor­po­ra­tions, and tech com­pa­nies in par­tic­u­lar as the lead­ing edge, rou­tinely solve plan­ning prob­lems for lo­gis­tics like fleets of cars or dat­a­cen­ter op­ti­miza­tion in­volv­ing mil­lions of vari­ables; the sim­i­lar SAT solvers are ubiq­ui­tous in com­puter se­cu­rity re­search for mod­el­ing large com­puter code­bases to ver­ify safety or dis­cover vul­ner­a­bil­i­ties; most ro­bots could­n’t op­er­ate with­out con­stantly solv­ing & op­ti­miz­ing enor­mous sys­tems of equa­tions. The in­ter­nal planned ‘economies’ of tech com­pa­nies have grown kudzu-like, sprout­ing ever larger datasets to pre­dict and au­to­mated analy­ses to plan and to con­trol. The prob­lems solved by re­tail­ers like Wal­mart or Tar­get are world-sized.2 (‘“We are not set­ting the price. The mar­ket is set­ting the price”, he says. “We have al­go­rithms to de­ter­mine what that mar­ket is.”’) The motto of a Google or Ama­zon or Uber might be (to para­phrase Free­man Dyson’s para­phrase of John von Neu­mann in , 1988): “All processes that are sta­ble we shall plan. All processes that are un­sta­ble we shall com­pete in (for now).” Com­pa­nies may use some lim­ited in­ter­nal ‘mar­kets’ as use­ful metaphors for al­lo­ca­tion, and dab­ble in , but the in­ter­nal dy­nam­ics of tech com­pa­nies bear lit­tle re­sem­blance to com­pet­i­tive free mar­kets, and show lit­tle sign of mov­ing in mar­ket-ward di­rec­tions.

The march of plan­ning also shows lit­tle sign of stop­ping. Uber is not go­ing to stop us­ing his­tor­i­cal fore­casts of de­mand to move around dri­vers to meet ex­pected de­mand and op­ti­mize trip tra­jec­to­ries; dat­a­cen­ters will not stop us­ing lin­ear solvers to al­lo­cate run­ning jobs to ma­chines in an op­ti­mal man­ner to min­i­mize elec­tric­ity con­sump­tion while bal­anc­ing against la­tency and through­put, in search of a vir­tu­ous cy­cle cul­mi­nat­ing in the op­ti­mal route, “the per­pet­ual trip, the trip that never ends”; ‘mar­kets’ like smart­phone walled gar­dens rely ever more each year on al­go­rithms pars­ing hu­man re­views & bi­na­ries & clicks to de­cide how to rank or push ad­ver­tis­ing and con­duct mul­ti­-armed ban­dit ex­plo­ration of op­tions; and so on end­less­ly.

So, can we run a econ­omy with scaled-up plan­ning ap­proach­ing 100% cen­tral­iza­tion, while in­creas­ing effi­ciency and even out­com­pet­ing free cap­i­tal­is­m-style com­pet­i­tive mar­kets, as Cock­shott & Cot­trell pro­pose (a pro­posal oc­ca­sion­ally re­vived in pop so­cial­ism like The Peo­ple’s Re­pub­lic of Wal­mart: How the World’s Biggest Cor­po­ra­tions are Lay­ing the Foun­da­tion for So­cial­ism)?


Let’s look at some more ex­am­ples:

  1. cor­po­ra­tions and growth
  2. hu­mans, brains, and cells
  3. meta-learn­ing in AI (par­tic­u­larly RL)

Artificial Persons

The strik­ing thing about cor­po­ra­tions im­prov­ing is that they don’t; cor­po­ra­tions don’t evolve (see the & , which can be ap­plied to ). The busi­ness world would look com­pletely differ­ent if they did! De­spite large differ­ences in com­pe­tency be­tween cor­po­ra­tions, the best cor­po­ra­tions don’t sim­ply ‘clone’ them­selves and reg­u­larly take over ar­bi­trary in­dus­tries with their su­pe­rior skills, only to even­tu­ally suc­cumb to their mu­tant off­spring who have be­come even more effi­cient.

We can copy the best soft­ware al­go­rithms, like Al­p­haZe­ro, in­defi­nitely and they will per­form as well as the orig­i­nal, and we can tweak them in var­i­ous ways to make them steadily bet­ter (and this is in fact how many al­go­rithms are de­vel­oped, by con­stant it­er­a­tion); species can re­pro­duce them­selves, steadily evolv­ing to ever bet­ter ex­ploit their nich­es, not to men­tion the power of se­lec­tive breed­ing pro­grams; in­di­vid­ual hu­mans can re­fine teach­ing meth­ods and trans­mit com­pe­tence (cal­cu­lus used to be re­served for the most skilled math­e­mati­cians, and now is taught to or­di­nary high school stu­dents, and chess grand­mas­ters have be­come steadily younger with bet­ter & more in­ten­sive teach­ing meth­ods like chess en­gi­nes); we could even clone ex­cep­tional in­di­vid­u­als to get more sim­i­larly tal­ented in­di­vid­u­als, if we re­ally wanted to. But we don’t see this hap­pen with cor­po­ra­tions. In­stead, de­spite des­per­ate strug­gles to main­tain “cor­po­rate cul­ture”, com­pa­nies typ­i­cally coast along, get­ting more and more slug­gish, fail­ing to spin off smaller com­pa­nies as lean & mean as they used to be, un­til con­di­tions change or ran­dom shocks or degra­da­tion fi­nally do them in, such as per­haps some com­plete­ly-un­re­lated com­pany (some­times founded by a com­plete out­sider like a col­lege stu­dent) eat­ing their lunch.

Why do we not see ex­cep­tional cor­po­ra­tions clone them­selves and take over all mar­ket seg­ments? Why don’t cor­po­ra­tions evolve such that all cor­po­ra­tions or busi­nesses are now the hy­per­-effi­cient de­scen­dants of a sin­gle ur-cor­po­ra­tion 50 years ago, all other cor­po­ra­tions hav­ing gone ex­tinct in bank­ruptcy or been ac­quired? Why is it so hard for cor­po­ra­tions to keep their “cul­ture” in­tact and re­tain their youth­ful lean effi­cien­cy, or if avoid­ing ‘ag­ing’ is im­pos­si­ble, why copy them­selves or oth­er­wise re­pro­duce to cre­ate new cor­po­ra­tions like them­selves? In­stead, suc­cess­ful large cor­po­ra­tions coast on in­er­tia or mar­ket fail­ures like reg­u­la­tory capture/monopoly, while suc­cess­ful small ones worry end­lessly about how to pre­serve their ‘cul­ture’ or how to ‘stay hun­gry’ or find a re­place­ment for the founder as they grow, and there is con­stant turnover. The large cor­po­ra­tions func­tion just well enough that main­tain­ing their ex­is­tence is an achieve­ment3.

Evo­lu­tion & the Price equa­tion re­quires 3 things: en­ti­ties which can repli­cate them­selves; vari­a­tion of en­ti­ties; and se­lec­tion on en­ti­ties. Cor­po­ra­tions have vari­a­tion, they have se­lec­tion—but they don’t have repli­ca­tion.

Cor­po­ra­tions cer­tainly un­dergo se­lec­tion for kinds of fit­ness, and do vary a lot. The prob­lem seems to be that cor­po­ra­tions can­not repli­cate them­selves. They can set up new cor­po­ra­tions, yes, but that’s not nec­es­sar­ily repli­cat­ing them­selves—they can­not clone them­selves the way a bac­te­ria can. When a bac­te­ria clones it­self, it has… a clone, which is diffi­cult to dis­tin­guish in any way from the ‘orig­i­nal’. In sex­ual or­gan­isms, chil­dren still re­sem­ble their par­ents to a great ex­tent. But when a large cor­po­ra­tion spins off a di­vi­sion or starts a new one, the re­sult may be noth­ing like the par­ent and com­pletely lack any se­cret sauce. A new ac­qui­si­tion will re­tain its orig­i­nal char­ac­ter and effi­cien­cies (if any). A cor­po­ra­tion sat­is­fies the Pe­ter Prin­ci­ple by even­tu­ally grow­ing to its level of in­com­pe­tence, which is al­ways much smaller than ‘the en­tire econ­omy’. Cor­po­ra­tions are made of peo­ple, not in­ter­change­able eas­i­ly-copied wid­gets or strands of DNA. There is no ‘cor­po­rate DNA’ which can be copied to cre­ate a new one just like the old. The cor­po­ra­tion may not even be able to ‘repli­cate’ it­self over time, lead­ing to scle­roti­cism and ag­ing—but this then leads to un­der­per­for­mance and even­tu­ally se­lec­tion against it, one way or an­oth­er. So, an av­er­age cor­po­ra­tion ap­pears lit­tle more effi­cient, par­tic­u­larly if we ex­clude any gains from new tech­nolo­gies, than an av­er­age cor­po­ra­tion 50 years ago, and the chal­lenges and fail­ures of the rare multi­na­tional cor­po­ra­tion 500 years ago like the Medici bank look strik­ingly sim­i­lar to chal­lenges and fail­ures of banks to­day.

We can see a sim­i­lar prob­lem with other large-s­cale hu­man or­ga­ni­za­tions: ‘cul­tures’. An idea seen some­times is that cul­tures un­dergo se­lec­tion & evo­lu­tion, and as such, are made up of adap­tive beliefs/practices/institutions, which no in­di­vid­ual un­der­stands (such as farm­ing prac­tices op­ti­mally tai­lored to lo­cal con­di­tion­s); even ap­par­ently highly ir­ra­tional & waste­ful tra­di­tional prac­tices may ac­tu­ally be an adap­tive evolved re­spon­se, which is op­ti­mal in some sense we as yet do not ap­pre­ci­ate (some­times linked to “Chester­ton’s fence” as an ar­gu­ment for sta­tus quo-is­m).

This is not a ridicu­lous po­si­tion, since oc­ca­sion­ally cer­tain tra­di­tional prac­tices have been vin­di­cated by sci­en­tific in­ves­ti­ga­tion, but the lenses of mul­ti­level se­lec­tion as de­fined by the Price equa­tion shows there are se­ri­ous quan­ti­ta­tive is­sues with this: cul­tures or groups are rarely dri­ven ex­tinct, with most large-s­cale ones per­sist­ing for mil­len­nia; such ‘nat­ural se­lec­tion’ on the group-level is only ten­u­ously linked to the many thou­sands of dis­tinct prac­tices & be­liefs that make up these cul­tures; and these cul­tures mu­tate rapidly as fads and vi­sions and sto­ries and neigh­bor­ing cul­tures and new tech­nolo­gies all change over time (com­pare the con­sis­tency of folk magic/medicine over even small ge­o­graphic re­gions, or in the same place over sev­eral cen­turies). For most things, ‘tra­di­tional cul­ture’ is sim­ply flatout wrong and harm­ful and all forms are mu­tu­ally con­tra­dic­to­ry, not ver­i­fied by sci­ence, and con­tains no use­ful in­for­ma­tion, and—­con­trary to “Chester­ton’s fence”—the older and harder it is to find a ra­tio­nal ba­sis for a prac­tice, the less likely it is to be help­ful:

Chester­ton’s meta-fence: “in our cur­rent sys­tem (de­mo­c­ra­tic mar­ket economies with large gov­ern­ments) the com­mon prac­tice of tak­ing down Chester­ton fences is a process which seems well es­tab­lished and has a de­cent track record, and should not be un­duly in­ter­fered with (un­less you fully un­der­stand it)”.

The ex­is­tence of many er­ro­neous prac­tices, and the suc­cess­ful diffu­sion of er­ro­neous ones, is ac­knowl­edged by pro­po­nents of cul­tural evo­lu­tion like Hein­rich (eg Hein­rich pro­vides sev­eral ex­am­ples which are com­pa­ra­ble to spread­ing harm­ful mu­ta­tion­s), so the ques­tion here is one of em­pha­sis or quan­ti­ty: is the glass 1% full or 99% emp­ty? It’s worth re­call­ing the con­di­tions for hu­man ex­per­tise (Arm­strong 2001, ; 2005, Ex­pert Po­lit­i­cal Judg­ment: How Good Is It? How Can We Know?; ed Er­ic­s­son 2006, The Cam­bridge Hand­book of Ex­per­tise and Ex­pert Per­for­mance; Kah­ne­man & Klein 2009): re­peated prac­tice with quick feed­back on ob­jec­tive out­comes in un­chang­ing en­vi­ron­ments; these con­di­tions are sat­is­fied for rel­a­tively few hu­man ac­tiv­i­ties, which are more often rare, with long-de­layed feed­back, left to quite sub­jec­tive ap­praisals mixed in with enor­mous amounts of ran­dom­ness & con­se­quences of many other choices before/after, and sub­ject to po­ten­tially rapid change (and the more so the more peo­ple are able to learn). In such en­vi­ron­ments, peo­ple are more likely to fail to build ex­per­tise, be fooled by ran­dom­ness, and con­struct elab­o­rate yet er­ro­neous the­o­ret­i­cal ed­i­fices of su­per­sti­tion (like Tet­lock’s hedge­hogs). Evo­lu­tion is no fairy dust which can over­come these se­ri­ous in­fer­en­tial prob­lems, which are why re­in­force­ment learn­ing is so hard.4

For , with reg­u­lar feed­back, re­sults which are enor­mously im­por­tant to both in­di­vid­ual and group sur­vival, and rel­a­tively straight­for­ward mech­a­nis­tic cause-and-effect re­la­tion­ships, it is not sur­pris­ing that prac­tices tend to be some­what op­ti­mized (although still far from op­ti­mal, as enor­mously in­creased yields in the In­dus­trial Rev­o­lu­tion demon­strate, in part by avoid­ing the er­rors of tra­di­tional agri­cul­ture & )5 ; but none of that ap­plies to ‘tra­di­tional med­i­cine’, deal­ing as it does with com­plex self­-s­e­lec­tion, re­gres­sion to the mean, and placebo effects, where aside from the sim­plest cases like set­ting bro­ken bones (a­gain, straight­for­ward, with cause-and-effect re­la­tion­ship), hardly any of it works6 and one is lucky if a tra­di­tional rem­edy is merely in­effec­tive rather than out­right poi­so­nous, and in the hard­est cases like snake bites, it would be bet­ter to wait for death at home than waste time go­ing to the lo­cal witch doc­tor.

So—just like cor­po­ra­tions—‘se­lec­tion’ of cul­tures hap­pens rarely with each ‘gen­er­a­tion’ span­ning cen­turies or mil­len­nia, typ­i­cally has lit­tle to do with how re­al­i­ty-based their be­liefs tend to be (for a se­lec­tion co­effi­cient ap­proach­ing ze­ro), and if one cul­ture did in fact con­sume an­other one thanks to more use­ful be­liefs about some herb, it is likely to back­slide un­der the bom­bard­ment of memetic mu­ta­tion (so any se­lec­tion is spent just purg­ing mu­ta­tions, cre­at­ing a mu­ta­tion-s­e­lec­tion bal­ance); un­der such con­di­tions, there will be lit­tle long-term ‘evo­lu­tion’ to­wards higher op­ti­ma, and the in­for­ma­tion con­tent of cul­ture will be min­i­mal and closely con­strained to only the most uni­ver­sal, high­-fit­ness-im­pact, and memet­i­cal­ly-ro­bust as­pects.

Natural Persons

“In­di­vid­ual or­gan­isms are best thought of as adap­ta­tion-ex­e­cuters rather than as fit­ness-max­i­miz­ers. Nat­ural se­lec­tion can­not di­rectly ‘see’ an in­di­vid­ual or­gan­ism in a spe­cific sit­u­a­tion and cause be­hav­ior to be adap­tively tai­lored to the func­tional re­quire­ments im­posed by that sit­u­a­tion.”

Tooby & Cos­mides 1992, “The Psy­cho­log­i­cal Foun­da­tions of Cul­ture”

“Good ide­ol­o­gy. Wrong species.”

, of

Con­trast that with a hu­man. De­spite ul­ti­mately be­ing de­signed by evo­lu­tion, evo­lu­tion then plays no role at ‘run­time’ and more pow­er­ful learn­ing al­go­rithms take over.

With these more pow­er­ful al­go­rithms de­signed by the meta-al­go­rithm of evo­lu­tion, a hu­man is able to live suc­cess­fully for over 100 years, with tremen­dous co­op­er­a­tion be­tween the tril­lions of cells in their body, only rarely break­ing down to­wards the end with a small hand­ful of seed can­cer cells de­fect­ing over a life­time de­spite even more tril­lions of cell di­vi­sions and re­place­ments. They are also able to be cloned, yield­ing iden­ti­cal twins so sim­i­lar across the board that peo­ple who know them may be un­able to dis­tin­guish them. And they don’t need to use evo­lu­tion or mar­kets to de­velop these bod­ies, in­stead, re­ly­ing on a com­plex hard­wired de­vel­op­men­tal pro­gram con­trolled by genes which en­sures that >99% of hu­mans get the two pairs of eyes, lungs, legs, brain hemi­spheres etc that they need. Per­haps the most strik­ing effi­ciency gain from a hu­man is the pos­ses­sion of a brain with the abil­ity to pre­dict the fu­ture, learn highly ab­stract mod­els of the world, and plan and op­ti­mize over these plans for ob­jec­tives which may only re­late in­di­rectly to fit­ness decades from now or fit­ness-re­lated events which hap­pen less than once in a life­time & are usu­ally un­ob­served or fit­ness events like that of de­scen­dants which can never be ob­served.


Black Box vs White Box Optimization

Let’s put it an­other way.

Imag­ine try­ing to run a busi­ness in which the only feed­back given is whether you go bank­rupt or not. In run­ning that busi­ness, you make mil­lions or bil­lions of de­ci­sions, to adopt a par­tic­u­lar mod­el, rent a par­tic­u­lar store, ad­ver­tise this or that, hire one per­son out of scores of ap­pli­cants, as­sign them this or that task to make many de­ci­sions of their own (which may in turn re­quire de­ci­sions to be made by still oth­er­s), and so on, ex­tended over many years. At the end, you turn a healthy profit, or go bank­rupt. So you get 1 bit of feed­back, which must be split over bil­lions of de­ci­sions. When a com­pany goes bank­rupt, what killed it? Hir­ing the wrong ac­coun­tant? The CEO not in­vest­ing enough in R&D? Ran­dom geopo­lit­i­cal events? New gov­ern­ment reg­u­la­tions? Putting its HQ in the wrong city? Just a gen­er­al­ized in­effi­cien­cy? How would you know which de­ci­sions were good and which were bad? How do you solve the “credit as­sign­ment prob­lem”?

Ide­al­ly, you would have some way of trac­ing back every change in the fi­nan­cial health of a com­pany back to the orig­i­nal de­ci­sion & the al­go­rithm which made that de­ci­sion, but of course this is im­pos­si­ble since there is no way to know who said or did what or even who dis­cussed what with whom when. There would seem to be no gen­eral ap­proach other than the truly brute force one of evo­lu­tion: over many com­pa­nies, have some act one way and some act an­other way, and on av­er­age, good de­ci­sions will clus­ter in the sur­vivors and not-so-good de­ci­sions will clus­ter in the de­ceased. ‘Learn­ing’ here works (un­der cer­tain con­di­tion­s—­like suffi­ciently re­li­able repli­ca­tion—which in prac­tice may not ob­tain) but is hor­rifi­cally ex­pen­sive & slow.

In RL, this would cor­re­spond to black box/gradient-free meth­ods, par­tic­u­larly evo­lu­tion­ary meth­ods. For ex­am­ple, uses an evo­lu­tion­ary method in which thou­sands of slight­ly-ran­dom­ized neural net­works play an Atari game si­mul­ta­ne­ous­ly, and at the end of the games, a new av­er­age neural net­work is de­fined based on the per­for­mance of them all; no at­tempt is made to fig­ure out which spe­cific changes are good or bad or even to get a re­li­able es­ti­mate—they sim­ply run and the scores are what they are. If we imag­ine a schematic like ‘mod­els → model pa­ra­me­ters → en­vi­ron­ments → de­ci­sions → out­comes’, evo­lu­tion col­lapses it to just ‘mod­els → out­comes’; feed a bunch of pos­si­ble mod­els in, get back out­comes, pick the mod­els with best out­comes.

A more sam­ple-effi­cient method would be some­thing like REINFORCE, which An­drej Karpa­thy ex­plains with an ALE Pong agent; what does REINFORCE do to crack the black box open a lit­tle bit? It’s still hor­rific and amaz­ing that it works:

So here is how the train­ing will work in de­tail. We will ini­tial­ize the pol­icy net­work with some W1, W2 and play 100 games of Pong (we call these pol­icy “roll­outs”). Lets as­sume that each game is made up of 200 frames so in to­tal we’ve made 20,000 de­ci­sions for go­ing UP or DOWN and for each one of these we know the pa­ra­me­ter gra­di­ent, which tells us how we should change the pa­ra­me­ters if we wanted to en­cour­age that de­ci­sion in that state in the fu­ture. All that re­mains now is to la­bel every de­ci­sion we’ve made as good or bad. For ex­am­ple sup­pose we won 12 games and lost 88. We’ll take all de­ci­sions we made in the win­ning games and do a pos­i­tive up­date (fill­ing in a +1.0 in the gra­di­ent for the sam­pled ac­tion, do­ing back­prop, and pa­ra­me­ter up­date en­cour­ag­ing the ac­tions we picked in all those states). And we’ll take the other de­ci­sions we made in the los­ing games and do a neg­a­tive up­date (dis­cour­ag­ing what­ever we did). And… that’s it. The net­work will now be­come slightly more likely to re­peat ac­tions that worked, and slightly less likely to re­peat ac­tions that did­n’t work. Now we play an­other 100 games with our new, slightly im­proved pol­icy and rinse and re­peat.

Pol­icy Gra­di­ents: Run a pol­icy for a while. See what ac­tions led to high re­wards. In­crease their prob­a­bil­i­ty.

If you think through this process you’ll start to find a few funny prop­er­ties. For ex­am­ple what if we made a good ac­tion in frame 50 (bounc­ing the ball back cor­rect­ly), but then missed the ball in frame 150? If every sin­gle ac­tion is now la­beled as bad (be­cause we lost), would­n’t that dis­cour­age the cor­rect bounce on frame 50? You’re right—it would. How­ev­er, when you con­sider the process over thousands/millions of games, then do­ing the first bounce cor­rectly makes you slightly more likely to win down the road, so on av­er­age you’ll see more pos­i­tive than neg­a­tive up­dates for the cor­rect bounce and your pol­icy will end up do­ing the right thing.

…I did not tune the hy­per­pa­ra­me­ters too much and ran the ex­per­i­ment on my (s­low) Mac­book, but after train­ing for 3 nights I ended up with a pol­icy that is slightly bet­ter than the AI play­er. The to­tal num­ber of episodes was ap­prox­i­mately 8,000 so the al­go­rithm played roughly 200,000 Pong games (quite a lot is­n’t it!) and made a to­tal of ~800 up­dates.

The differ­ence here from evo­lu­tion is that the credit as­sign­ment is able to use back­prop­a­ga­tion to reach into the NN and di­rectly ad­just their con­tri­bu­tion to the de­ci­sion which was ‘good’ or ‘bad’; the diffi­culty of trac­ing out the con­se­quences of each de­ci­sion and la­bel­ing it ‘good’ is sim­ply by­passed with the brute force ap­proach of de­cree­ing that all ac­tions taken in a ul­ti­mate­ly-suc­cess­ful game were good, and all of them were bad if the game is ul­ti­mately bad. Here we op­ti­mize some­thing more like ‘model pa­ra­me­ters → de­ci­sions → out­comes’; we feed pa­ra­me­ters in to get out de­ci­sions which then are as­sumed to cause the out­come, and re­verse it to pick the pa­ra­me­ters with the best out­comes.

This is still crazy, but it works, and bet­ter than sim­ple-minded evo­lu­tion: Sal­i­mans et al 2017 com­pares their evo­lu­tion method to more stan­dard meth­ods which are fancier ver­sions of the REINFORCE pol­icy gra­di­ent ap­proach, and this bru­tally lim­ited use of back­prop­a­ga­tion for credit as­sign­ment still cuts the sam­ple size by 3–10x, and more on more diffi­cult prob­lems.

Can we do bet­ter? Of course. It is ab­surd to claim that all ac­tions in a game de­ter­mine the fi­nal out­come, since the en­vi­ron­ment it­self is sto­chas­tic and many de­ci­sions are ei­ther ir­rel­e­vant or were the op­po­site in true qual­ity of what­ever the out­come was. To do bet­ter, we can con­nect the de­ci­sions to the en­vi­ron­ment by mod­el­ing the en­vi­ron­ment it­self as a white box which can be cracked open & an­a­lyzed, us­ing a mod­el-based RL ap­proach like the well-known PILCO.

In PILCO, a model of the en­vi­ron­ment is learned by a pow­er­ful model (the non-neu­ral-net­work , in this case), and the model is used to do plan­ning: start with a se­ries of pos­si­ble ac­tions, run them through the model to pre­dict what would hap­pen, and di­rectly op­ti­mize the ac­tions to max­i­mize the re­ward. The in­flu­ence of the pa­ra­me­ters of the model caus­ing the cho­sen ac­tions, which then par­tially cause the en­vi­ron­ment, which then par­tially cause the re­ward, can all be traced from the fi­nal re­ward back to the orig­i­nal pa­ra­me­ters. (It’s white boxes all the way down.) Here the full ‘mod­els → model pa­ra­me­ters → en­vi­ron­ments → de­ci­sions → out­comes’ pipeline is ex­pressed and the credit as­sign­ment is per­formed cor­rectly & as a whole.

The re­sult is state-of-the-art sam­ple effi­cien­cy: in a sim­ple prob­lem like Cart­pole, PILCO can solve it within as lit­tle as 10 episodes, while stan­dard deep re­in­force­ment learn­ing ap­proaches like pol­icy gra­di­ents can strug­gle to solve it within 10,000 episodes.

The prob­lem, of course, with mod­el-based RL such as PILCO is that what they gain in cor­rect­ness & sam­ple-effi­cien­cy, they give back in com­pu­ta­tional re­quire­ments: I can’t com­pare PILCO’s sam­ple-effi­ciency with Sal­i­mans et al 2017’s ALE sam­ple-effi­ciency or even Karpa­thy’s Pong sam­ple-effi­ciency be­cause PILCO sim­ply can’t be run on prob­lems all that much more com­plex than Cart­pole.

So we have a painful dilem­ma: sam­ple-effi­ciency can be many or­ders of mag­ni­tude greater than pos­si­ble with evo­lu­tion, if only one could do more pre­cise fine-grained credit as­sign­men­t—in­stead of judg­ing bil­lions of de­ci­sions based solely on a sin­gle dis­tant noisy bi­nary out­come, the al­go­rithm gen­er­at­ing each de­ci­sion can be traced through all of its ram­i­fi­ca­tions through all sub­se­quent de­ci­sions & out­comes to a fi­nal re­ward—but these bet­ter meth­ods are not di­rectly ap­plic­a­ble. What to do?

Going Meta

“…the spac­ing that has made for the most suc­cess­ful in­duc­tions will have tended to pre­dom­i­nate through nat­ural se­lec­tion. Crea­tures in­vet­er­ately wrong in their in­duc­tions have a pa­thetic but praise­wor­thy ten­dency to die be­fore re­pro­duc­ing their kind….In in­duc­tion noth­ing suc­ceeds like suc­cess.”

, “Nat­ural Kinds” 1969

Speak­ing of evo­lu­tion­ary al­go­rithms & sam­ple-effi­cien­cy, an in­ter­est­ing area of AI and re­in­force­ment learn­ing is “meta-learn­ing”, usu­ally de­scribed as “learn­ing to learn” (). This rewrites a given learn­ing task as a two-level prob­lem, where one seeks a meta-al­go­rithm for a fam­ily of prob­lems which then adapts at run­time to the spe­cific prob­lem at hand. (In evo­lu­tion­ary terms, this could be seen as re­lated to a .) There are many par­a­digms in meta-learn­ing us­ing var­i­ous kinds of learn­ing & op­ti­miz­ers; for list­ing of sev­eral re­cent ones, see Ta­ble 1 of (re­pro­duced in an ap­pen­dix).

For ex­am­ple, one could train an RNN on a ‘left or right’ T-maze task where the di­rec­tion with the re­ward switches at ran­dom every once in a while: the RNN has a mem­o­ry, its hid­den state, so after try­ing the left arm a few times and ob­serv­ing no re­ward, it can en­code “the re­ward has switched to the right”, and then de­cide to go right every time while con­tin­u­ing to en­code how many fail­ures it’s had after the switch; when the re­ward then switches back to the left, after a few fail­ures on the right, the learned rule will fire and it’ll switch back to the left. With­out this se­quen­tial learn­ing, if it was just trained on a bunch of sam­ples, where half the ‘lefts’ have a re­ward and half the ‘rights’ also have a re­ward (be­cause of the con­stant switch­ing), it’ll learn a bad strat­egy like pick­ing a ran­dom choice 50-50, or al­ways go­ing left/right. An­other ap­proach is ‘fast weights’, where a start­ing meta-NN ob­serves a few dat­a­points from a new prob­lem, and then emits the ad­justed pa­ra­me­ters for a new NN, spe­cial­ized to the prob­lem, which is then run ex­actly and re­ceives a re­ward, so the meta-NN can learn to emit ad­justed pa­ra­me­ters which will achieve high re­ward on all prob­lems. A ver­sion of this might be the MAML meta-learn­ing al­go­rithms () where a meta-NN is learned which is care­fully bal­anced be­tween pos­si­ble NNs so that a few fine­tun­ing steps of gra­di­ent de­scent train­ing within a new prob­lem ‘spe­cial­izes’ it to that prob­lem (one might think of the meta-NN as be­ing a point in the high­-di­men­sional model space which is roughly equidis­tant from a large num­ber of NNs trained on each in­di­vid­ual prob­lem, where tweak­ing a few pa­ra­me­ters con­trols over­all be­hav­ior and only those need to be learned from the ini­tial ex­pe­ri­ences). In gen­er­al, meta-learn­ing en­ables learn­ing of the su­pe­rior Bayes-op­ti­mal agent within en­vi­ron­ments by in­effi­cient (pos­si­bly not even Bayesian) train­ing across en­vi­ron­ments (). As Duff 2002 puts it, “One way of think­ing about the com­pu­ta­tional pro­ce­dures that I later pro­pose is that they per­form an offline com­pu­ta­tion of an on­line, adap­tive ma­chine. One may re­gard the process of ap­prox­i­mat­ing an op­ti­mal pol­icy for the Markov de­ci­sion process de­fined over hy­per­-s­tates as ‘com­pil­ing’ an op­ti­mal learn­ing strat­e­gy, which can then be ‘loaded’ into an agent.”

An in­ter­est­ing ex­am­ple of this ap­proach is the Deep­Mind pa­per , which presents a Quake team FPS agent trained us­ing a two-level ap­proach (and which ex­tends it fur­ther with mul­ti­ple pop­u­la­tions; for back­ground, see Sut­ton & Barto 2018; for an evo­lu­tion­ary man­i­festo, see ), an ap­proach which was valu­able for their Al­phaS­tar Star­Craft II agent pub­li­cized in Jan­u­ary 2019. The FPS game is a mul­ti­player cap­ture-the-flag match where teams com­pete on a map, rather than the agent con­trol­ling a sin­gle agent in a death-match set­ting; learn­ing to co­or­di­nate, as well as ex­plic­itly com­mu­ni­cate, with mul­ti­ple copies of one­self is tricky and nor­mal train­ing meth­ods don’t work well be­cause up­dates change all the other copies of one­self as well and desta­bi­lize any com­mu­ni­ca­tion pro­to­cols which have been learned. What Jader­berg does is use nor­mal deep RL tech­niques within each agent, pre­dict­ing and re­ceiv­ing re­wards within each game based on earn­ing points for flags/attacks, but then the over­all pop­u­la­tion of 30 agents, after each set of match­es, un­der­goes a sec­ond level of se­lec­tion based on fi­nal game score/victory, which then se­lects on the agen­t’s in­ter­nal re­ward pre­dic­tion & hy­per­pa­ra­me­ters

This can be seen as a two-tier re­in­force­ment learn­ing prob­lem. The in­ner op­ti­mi­sa­tion max­imises Jin­ner, the agents’ ex­pected fu­ture dis­counted in­ter­nal re­wards. The outer op­ti­mi­sa­tion of Jouter can be viewed as a meta-game, in which the meta-re­ward of win­ning the match is max­imised with re­spect to in­ter­nal re­ward schemes wp and hy­per­pa­ra­me­ters φp, with the in­ner op­ti­mi­sa­tion pro­vid­ing the meta tran­si­tion dy­nam­ics. We solve the in­ner op­ti­mi­sa­tion with RL as pre­vi­ously de­scribed, and the outer op­ti­mi­sa­tion with . PBT is an on­line evo­lu­tion­ary process which adapts in­ter­nal re­wards and hy­per­pa­ra­me­ters and per­forms model se­lec­tion by re­plac­ing un­der­-per­form­ing agents with mu­tated ver­sions of bet­ter agents. This joint op­ti­mi­sa­tion of the agent pol­icy us­ing RL to­gether with the op­ti­mi­sa­tion of the RL pro­ce­dure it­self to­wards a high­-level goal proves to be an effec­tive and gen­er­ally ap­plic­a­ble strat­e­gy, and utilises the po­ten­tial of com­bin­ing learn­ing and evo­lu­tion (2) in large scale learn­ing sys­tems.

The goal is to win, the ground-truth re­ward is the win/loss, but learn­ing only from win/loss is ex­tremely slow: a sin­gle bit (prob­a­bly less) of in­for­ma­tion must be split over all ac­tions taken by all agents in the game and used to train NNs with mil­lions of in­ter­de­pen­dent pa­ra­me­ters, in a par­tic­u­larly in­effi­cient way as one can­not com­pute ex­act gra­di­ents from the win/loss back to the re­spon­si­ble neu­rons. With­in-game points are a much richer form of su­per­vi­sion, more nu­mer­ous and cor­re­spond­ing to short time seg­ments, al­low­ing for much more learn­ing within each game (pos­si­bly us­ing ex­act gra­di­ents), but are only in­di­rectly re­lated to the fi­nal win/loss; an agent could rack up many points on its own while ne­glect­ing to fight the en­emy or co­or­di­nate well and en­sur­ing a fi­nal de­feat, or it could learn a greedy team strat­egy which per­forms well ini­tially but loses over the long run. So the two-tier prob­lem uses the slow ‘outer’ sig­nal or loss func­tion (win­ning) to sculpt the faster in­ner loss which does the bulk of the learn­ing. (“Or­gan­isms are adap­ta­tion-ex­ecu­tors, not fit­ness-max­i­miz­ers.”) Should the fast in­ner al­go­rithms not be learn­ing some­thing use­ful or go hay­wire or fall for a trap, the outer re­wards will even­tu­ally re­cover from the mis­take, by mu­tat­ing or aban­don­ing them in fa­vor of more suc­cess­ful lin­eages. This com­bines the crude, slow, dogged op­ti­miza­tion of evo­lu­tion, with the much faster, more clev­er, but po­ten­tially mis­guided gra­di­en­t-based op­ti­miza­tion, to pro­duce some­thing which will reach the right goal faster. (Two more re­cent ex­am­ples would be /.)

Two-Level Meta-Learning

Cosma Shal­izi, else­where, en­joys not­ing for­mal iden­ti­ties be­tween nat­ural se­lec­tion and Bayesian sta­tis­tics (e­spe­cially ) and mar­kets, where the pop­u­la­tion fre­quency of an al­lele cor­re­sponds to a pa­ra­me­ter’s prior prob­a­bil­ity or start­ing wealth of a trader, and fit­ness differentials/profits cor­re­spond to up­dates based on new ev­i­dence. (See also Evstigneev et al 2008/Lens­berg & Schenk-Hoppé 2006, , .) While a pa­ra­me­ter may start with er­ro­neously low pri­or, at some point the up­dates will make the pos­te­rior con­verge on it. (The re­la­tion­ship be­tween pop­u­la­tions of in­di­vid­ual with noisy fixed be­liefs, and , is also in­ter­est­ing: . Can we see the ap­par­ent­ly-in­effi­cient stream of star­tups try­ing ‘failed’ ideas—and oc­ca­sion­ally wind­ing up win­ning big—as a kind of col­lec­tive Thomp­son sam­pling & more effi­cient than it seem­s?) And can be seen as se­cretly an ap­prox­i­ma­tion or form of Bayesian up­dates by es­ti­mat­ing its gra­di­ents (be­cause every­thing that works works be­cause it’s Bayesian?) and of course evo­lu­tion­ary meth­ods can be seen as cal­cu­lat­ing ap­prox­i­ma­tions to gra­di­ents…

Analo­gies be­tween differ­ent optimization/inference mod­els
Model Pa­ra­me­ter Prior Up­date
Evo­lu­tion Al­lele Pop­u­la­tion Fre­quency Fit­ness Differ­en­tial
Mar­ket Trader Start­ing Wealth Profit
Par­ti­cle Fil­ter­ing Par­ti­cles Pop­u­la­tion Fre­quency Ac­cep­t-Re­ject Sam­ple
SGD Pa­ra­me­ter Ran­dom Ini­tial­iza­tion Gra­di­ent Step

This pat­tern sur­faces in our other ex­am­ples too. This two-level learn­ing is anal­o­gous to meta-learn­ing: the outer or meta-al­go­rithm learns how to gen­er­ate an in­ner or ob­jec­t-level al­go­rithm which can learn most effec­tive­ly, bet­ter than the meta-al­go­rithm. In­ner al­go­rithms them­selves can learn bet­ter al­go­rithms, and so on, gain­ing pow­er, com­pute-effi­cien­cy, or sam­ple-effi­cien­cy, with every level of spe­cial­iza­tion. (“It’s op­ti­miz­ers all the way up, young man!”) It’s also anal­o­gous to cells in a hu­man body: over­all re­pro­duc­tive fit­ness is a slow sig­nal that oc­curs only a few times in a life­time at most, but over many gen­er­a­tions, it builds up fast-re­act­ing de­vel­op­men­tal and home­o­sta­tic processes which can build an effi­cient and ca­pa­ble body and re­spond to en­vi­ron­men­tal fluc­tu­a­tions within min­utes rather than mil­len­nia, and the brain is still su­pe­rior with split-sec­ond sit­u­a­tions. It’s also anal­o­gous to cor­po­ra­tions in a mar­ket: the cor­po­ra­tion can use what­ever in­ter­nal al­go­rithms it pleas­es, such as lin­ear op­ti­miza­tion or neural net­works, and eval­u­ate them in­ter­nally us­ing in­ter­nal met­rics like “num­ber of daily users”; but even­tu­al­ly, this must re­sult in profits…

The cen­tral prob­lem a cor­po­ra­tion solves is how to mo­ti­vate, or­ga­nize, pun­ish & re­ward its sub­-u­nits and con­stituent hu­mans in the ab­sence of di­rect end-to-end losses with­out the use of slow ex­ter­nal mar­ket mech­a­nisms. This is done by tap­ping into so­cial mech­a­nisms like peer es­teem (sol­diers don’t fight for their coun­try, they fight for their bud­dies), se­lect­ing work­ers who are in­trin­si­cally mo­ti­vated to work use­fully rather than par­a­sit­i­cal­ly, con­stant at­tempts to in­still a “com­pany cul­ture” with slo­ga­neer­ing or hand­books or com­pany songs, use of mul­ti­ple proxy mea­sures for re­wards to re­duce Good­hart-style re­ward hack­ing, ad hoc mech­a­nisms like stock op­tions to try to in­ter­nal­ize within work­ers the mar­ket loss­es, re­plac­ing work­ers with out­sourc­ing or au­toma­tion, ac­quir­ing smaller com­pa­nies which have not yet de­cayed in­ter­nally or as a se­lec­tion mech­a­nism (“ac­qui­hires”), em­ploy­ing in­tel­lec­tual prop­erty or reg­u­la­tion… All of these tech­niques to­gether can align the parts into some­thing use­ful to even­tu­ally sell…

Man Proposes, God Disposes

…Or else the com­pany will even­tu­ally go bank­rupt:

Great is Bank­ruptcy: the great bot­tom­less gulf into which all False­hoods, pub­lic and pri­vate, do sink, dis­ap­pear­ing; whith­er, from the first ori­gin of them, they were all doomed. For Na­ture is true and not a lie. No lie you can speak or act but it will come, after longer or shorter cir­cu­la­tion, like a Bill drawn on Na­ture’s Re­al­i­ty, and be pre­sented there for pay­men­t,—with the an­swer, No effects. Pity only that it often had so long a cir­cu­la­tion: that the orig­i­nal forger were so sel­dom he who bore the fi­nal smart of it! Lies, and the bur­den of evil they bring, are passed on; shifted from back to back, and from rank to rank; and so land ul­ti­mately on the dumb low­est rank, who with spade and mat­tock, with sore heart and empty wal­let, daily come in con­tact with re­al­i­ty, and can pass the cheat no fur­ther.

…But with a For­tu­na­tus’ Purse in his pock­et, through what length of time might not al­most any False­hood last! Your So­ci­ety, your House­hold, prac­ti­cal or spir­i­tual Arrange­ment, is un­true, un­just, offen­sive to the eye of God and man. Nev­er­the­less its hearth is warm, its larder well re­plen­ished: the in­nu­mer­able Swiss of Heav­en, with a kind of Nat­ural loy­al­ty, gather round it; will prove, by pam­phle­teer­ing, mus­ke­teer­ing, that it is a truth; or if not an un­mixed (un­earth­ly, im­pos­si­ble) Truth, then bet­ter, a whole­somely at­tem­pered one, (as wind is to the shorn lam­b), and works well. Changed out­look, how­ev­er, when purse and larder grow emp­ty! Was your Arrange­ment so true, so ac­cor­dant to Na­ture’s ways, then how, in the name of won­der, has Na­ture, with her in­fi­nite boun­ty, come to leave it fam­ish­ing there? To all men, to all women and all chil­dren, it is now in­du­bitable that your Arrange­ment was false. Ho­n­our to Bank­rupt­cy; ever right­eous on the great scale, though in de­tail it is so cru­el! Un­der all False­hoods it works, un­wea­riedly min­ing. No False­hood, did it rise heav­en-high and cover the world, but Bank­rupt­cy, one day, will sweep it down, and make us free of it.7

A large cor­po­ra­tion like Sears may take decades to die (“There is a great deal of ruin in a na­tion”, Adam Smith ob­served), but die it does. Cor­po­ra­tions do not in­crease in per­for­mance rapidly and con­sis­tently the way se­lec­tive breed­ing or AI al­go­rithms do be­cause they can­not repli­cate them­selves as ex­actly as dig­i­tal neural net­works or bi­o­log­i­cal cells can, but, nev­er­the­less, they are still part of a two-tier process where a ground-truth uncheat­able outer loss con­strains the in­ter­nal dy­nam­ics to some de­gree and main­tain a base­line or per­haps mod­est im­prove­ment over time. The plan is “checked”, as Trot­sky puts it in crit­i­ciz­ing Stal­in’s poli­cies like aban­don­ing the , by sup­ply and de­mand:

If a uni­ver­sal mind ex­ist­ed, of the kind that pro­jected it­self into the sci­en­tific fancy of Laplace—a mind that could reg­is­ter si­mul­ta­ne­ously all the processes of na­ture and so­ci­ety, that could mea­sure the dy­nam­ics of their mo­tion, that could fore­cast the re­sults of their in­ter-re­ac­tion­s—­such a mind, of course, could a pri­ori draw up a fault­less and ex­haus­tive eco­nomic plan, be­gin­ning with the num­ber of acres of wheat down to the last but­ton for a vest. The bu­reau­cracy often imag­ines that just such a mind is at its dis­pos­al; that is why it so eas­ily frees it­self from the con­trol of the mar­ket and of So­viet democ­ra­cy. But, in re­al­i­ty, the bu­reau­cracy errs fright­fully in its es­ti­mate of its spir­i­tual re­sources.

…The in­nu­mer­able liv­ing par­tic­i­pants in the econ­o­my, state and pri­vate, col­lec­tive and in­di­vid­u­al, must serve no­tice of their needs and of their rel­a­tive strength not only through the sta­tis­ti­cal de­ter­mi­na­tions of plan com­mis­sions but by the di­rect pres­sure of sup­ply and de­mand. The plan is checked and, to a con­sid­er­able de­gree, re­al­ized through the mar­ket.

“Pain Is the Only School-Teacher”

Pain is a cu­ri­ous thing. Why do we have painful pain in­stead of just a more neu­tral pain­less pain, when it can back­fire so eas­ily as chronic pain, among other prob­lems? Why do we have pain at all in­stead of reg­u­lar learn­ing processes or ex­pe­ri­enc­ing re­wards as we fol­low plans?

Can we un­der­stand pain as an­other two-level learn­ing process, where a slow but ground-truth outer loss con­strains a fast but un­re­li­able in­ner loss? I would sug­gest that pain it­self is not an outer loss, but the painful­ness of pain, its in­tru­sive mo­ti­va­tional as­pects, is what makes it an outer loss. There is no log­i­cal ne­ces­sity for pain to be pain but this would not be adap­tive or prac­ti­cal be­cause it would too eas­ily let the in­ner loss lead to dam­ag­ing be­hav­ior.

Taxonomy of Pain

So let’s con­sider the pos­si­bil­i­ties when it comes to pain. There is­n’t just “pain”. There is (at the least):

  • use­less painful pain (chronic pain, ex­er­cise)

  • use­ful painful pain (the nor­mal sort)

  • use­less non­painful non­pain (dead nerves in di­a­betes or 8 or 91011121314; and are every­day ver­sions demon­strat­ing that even the most harm­less ac­tiv­i­ties like ‘ly­ing on a bed’ are in fact con­stantly caus­ing dam­age)

  • use­ful non­painful non­pain (a­dren­a­line rushes dur­ing com­bat)

  • use­less non­painful pain ( where they maim & kill them­selves, pos­si­bly also );

  • and in­ter­me­di­ate cas­es: like the who have a ge­netic mu­ta­tion (Habib et al 2018) which par­tially dam­ages pain per­cep­tion. The Mar­silis do feel use­ful painful pain but only briefly, and in­cur sub­stan­tial bod­ily dam­age (bro­ken bones, scars) but avoid the most hor­rific anec­dotes of those with dead­ened nerves or pain asym­bo­l­ia.

    An­other in­ter­est­ing case is the , who has a differ­ent set of mu­ta­tions to her en­do­cannabi­noid sys­tem (FAAH & FAAH-OUT): while not as bad as neu­ropa­thy, she still ex­hibits sim­i­lar symp­tom­s—her fa­ther who may also have been a car­rier died pe­cu­liar­ly, she reg­u­larly burns or cuts her­self in house­hold chores, she broke her arm roller-skat­ing as a child but did­n’t seek treat­ment, de­layed treat­ment of a dam­aged hip and then a hand dam­aged by arthri­tis un­til al­most too late15, took in fos­ter chil­dren who stole her sav­ings, etc. (Bi­ol­o­gist Matthew Hill de­scribes the most com­mon FAAH mu­ta­tion as caus­ing “low lev­els of anx­i­ety, for­get­ful­ness, a hap­py-go-lucky de­meanor”, and “Since the pa­per was pub­lished, Matthew Hill has heard from half a dozen peo­ple with pain in­sen­si­tiv­i­ty, and he told me that many of them seemed nuts” com­pared to Jo Cameron.)

  • but—is there ‘use­ful pain­less pain’ or ‘use­less painful non­pain’?

It turns out there is ‘pain­less pain’: peo­ple ex­pe­ri­ence that, and “re­ac­tive dis­so­ci­a­tion” is the phrase used to de­scribe the effects some­times of anal­gesics like mor­phine when ad­min­is­tered after pain has be­gun, and the pa­tient re­ports, to quote Den­nett 1978 (em­pha­sis in orig­i­nal), that “After re­ceiv­ing the anal­gesic sub­jects com­monly re­port not that the pain has dis­ap­peared or di­min­ished (as with as­pir­in) but that the pain is as in­tense as ever though they no longer mind it…if it is ad­min­is­tered be­fore the on­set of pain…the sub­jects claim to not feel any pain sub­se­quently (though they are not numb or anes­thetized—they have sen­sa­tion in the rel­e­vant parts of their bod­ies); while if the mor­phine is ad­min­is­tered after the pain has com­menced, the sub­jects re­port that the pain con­tin­ues (and con­tin­ues to be pain), though they no longer mind it…Lo­bot­o­mized sub­jects sim­i­larly re­port feel­ing in­tense pain but not mind­ing it, and in other ways the man­i­fes­ta­tions of lo­bot­omy and mor­phine are sim­i­lar enough to lead some re­searchers to de­scribe the ac­tion of mor­phine (and some bar­bi­tu­rates) as ‘re­versible phar­ma­co­log­i­cal leu­co­tomy [lo­bot­o­my]’.2316

And we can find ex­am­ples of what ap­pears to be ‘painful non­pain’: high­lights a case-s­tudy, Ploner et al 1999, where the Ger­man pa­tien­t’s so­matosen­sory cor­tices suffered a le­sion from a stroke, lead­ing to an in­abil­ity to feel heat nor­mally on one side of his body or feel any spots of heat or pain from heat; de­spite this, when suffi­cient heat was ap­plied to a sin­gle spot on the arm, the pa­tient be­came in­creas­ingly ag­i­tat­ed, de­scrib­ing an “clearly un­pleas­ant” feel­ing as­so­ci­ated with his whole arm, but also de­nied any de­scrip­tion of it in­volv­ing crawl­ing skin sen­sa­tions or words like “slight pain” or “burn­ing”.

A ta­ble might help lay out the pos­si­bil­i­ties:

A tax­on­omy of pos­si­ble kinds of ‘pain’, split by or­gan­is­mal con­se­quences, mo­ti­va­tional effects, and re­ported sub­jec­tive (non)­ex­pe­ri­ence.
Util­ity Aver­sive­ness Qualia pres­ence Ex­am­ples
use­less painful pain chronic pain; ex­er­cise?
use­ful painful pain normal/injuries
use­less non­painful pain asym­bo­lia
use­ful non­painful pain re­ac­tive dis­so­ci­a­tion, lo­bot­o­mies; ex­er­cise?
use­less painful non­pain un­con­scious processes such as anes­the­sia aware­ness. Itches or tick­les, ?17
use­ful painful non­pain cold/heat , as in the so­matosen­sory cor­tex le­sion case-s­tudy
use­less non­painful non­pain dead­ened nerves from dis­eases (di­a­betes, lep­rosy), in­jury, drugs (anes­thet­ics)
use­ful non­painful non­pain adren­a­line rush/accidents/combat

Pain serves a clear pur­pose (stop­ping us from do­ing things which may cause dam­age to our bod­ies), but in an oddly un­re­lent­ing way which we can­not dis­able and which in­creas­ingly often back­fires on our long-term in­ter­ests in the form of ‘chronic pain’ and other prob­lems. Why does­n’t pain op­er­ate more like a warn­ing, or like hunger or thirst? They in­ter­rupt our minds, but like a com­puter popup di­a­logue, after due con­sid­er­a­tion of our plans and knowl­edge, we can gen­er­ally dis­miss them. Pain is the in­ter­rup­tion which does­n’t go away, al­though (Morsella 2005):

The­o­ret­i­cal­ly, ner­vous mech­a­nisms could have evolved to solve the need for this par­tic­u­lar kind of in­ter­ac­tion oth­er­wise. Apart from au­tomata, which act like hu­mans but have no phe­nom­e­nal ex­pe­ri­ence, a con­scious ner­vous sys­tem that op­er­ates as hu­mans do but does not suffer any in­ter­nal strife. In such a sys­tem, knowl­edge guid­ing skele­to­mo­tor ac­tion would be iso­mor­phic to, and never at odds with, the na­ture of the phe­nom­e­nal state—run­ning across the hot desert sand in or­der to reach wa­ter would ac­tu­ally feel good, be­cause per­form­ing the ac­tion is deemed adap­tive. Why our ner­vous sys­tem does not op­er­ate with such har­mony is per­haps a ques­tion that only evo­lu­tion­ary bi­ol­ogy can an­swer. Cer­tainly one can imag­ine such in­te­gra­tion oc­cur­ring with­out any­thing like phe­nom­e­nal states, but from the present stand­point, this re­flects more one’s pow­ers of imag­i­na­tion than what has oc­curred in the course of evo­lu­tion­ary his­to­ry.

Hui Neng’s Flag

In the re­in­force­ment learn­ing con­text, one could ask: does it make a differ­ence whether one has ‘neg­a­tive’ or ‘pos­i­tive’ re­wards? Any re­ward func­tion with both neg­a­tive and pos­i­tive re­wards could be turned into al­l-pos­i­tive re­wards sim­ply by adding a large con­stant. Is that a differ­ence which makes a differ­ence? Or in­stead of max­i­miz­ing pos­i­tive ‘re­wards’, one could speak of min­i­miz­ing ‘losses’, and one often does in eco­nom­ics or de­ci­sion the­ory or 18.

, Tomasik 2014, de­bates the re­la­tion­ship of re­wards to con­sid­er­a­tions of “suffer­ing” or “pain”, given the du­al­ity be­tween costs-losses/rewards:

Per­haps the more ur­gent form of re­fine­ment than al­go­rithm se­lec­tion is to re­place pun­ish­ment with re­wards within a given al­go­rithm. RL sys­tems vary in whether they use pos­i­tive, neg­a­tive, or both types of re­wards:

  • In cer­tain RL prob­lems, such as maze-nav­i­ga­tion tasks dis­cussed in Sut­ton and Barto [1998], the re­wards are only pos­i­tive (if the agent reaches a goal) or zero (for non-goal states).
  • Some­times a mix be­tween pos­i­tive and neg­a­tive re­wards6 is used. For in­stance, Mc­Cal­lum [1993] put a sim­u­lated mouse in a maze, with a re­ward of 1 for reach­ing the goal, −1 for hit­ting a wall, and −0.1 for any other ac­tion.
  • In other sit­u­a­tions, the re­wards are al­ways neg­a­tive or ze­ro. For in­stance, in the cart-pole bal­anc­ing sys­tem of Barto et al. [1990], the agent re­ceives re­ward of 0 un­til the pole falls over, at which point the re­ward is −1. In Koppe­jan and White­son [2011]’s neu­roevo­lu­tion­ary RL ap­proach to he­li­copter con­trol, the RL agent is pun­ished ei­ther a lit­tle bit, with the neg­a­tive sum of squared de­vi­a­tions of the he­li­copter’s po­si­tions from its tar­get po­si­tions, or a lot if the he­li­copter crash­es.

Just as an­i­mal-wel­fare con­cerns may mo­ti­vate in­cor­po­ra­tion of re­wards rather than pun­ish­ments in train­ing dogs [Hiby et al., 2004] and horses [War­ren-Smith and Mc­Greevy, 2007, Innes and McBride, 2008], so too RL-a­gent wel­fare can mo­ti­vate more pos­i­tive forms of train­ing for ar­ti­fi­cial learn­ers. Pearce [2007] en­vi­sions a fu­ture in which agents are dri­ven by ‘gra­di­ents of well-be­ing’ (i.e., pos­i­tive ex­pe­ri­ences that are more or less in­tense) rather than by the dis­tinc­tion be­tween plea­sure ver­sus pain. How­ev­er, it’s not en­tirely clear where the moral bound­ary lies be­tween pos­i­tive ver­sus neg­a­tive wel­fare for sim­ple RL sys­tems. We might think that just the sign of the agen­t’s re­ward value r would dis­tin­guish the cas­es, but the sign alone may not be enough, as the fol­low­ing sec­tion ex­plains.

What’s the bound­ary be­tween pos­i­tive and neg­a­tive wel­fare?

Con­sider an RL agent with a fixed life of T time steps. At each time t, the agent re­ceives a non-pos­i­tive re­ward rt ≤ 0 as a func­tion of the ac­tion at that it takes, such as in the pole-bal­anc­ing ex­am­ple. The agent chooses its ac­tion se­quence (at) t = 1…T with the goal of max­imis­ing the sum of fu­ture re­wards:

Now sup­pose we rewrite the re­wards by adding a huge pos­i­tive con­stant c to each of them, r′t = rt + c, big enough that all of the r′t are pos­i­tive. The agent now acts so as to op­ti­mise

So the op­ti­mal ac­tion se­quence is the same in ei­ther case, since ad­di­tive con­stants don’t mat­ter to the agen­t’s be­hav­iour.7 But if be­hav­iour is iden­ti­cal, the only thing that changed was the sign and nu­mer­i­cal mag­ni­tude of the re­ward num­bers. Yet it seems ab­surd that the differ­ence be­tween hap­pi­ness and suffer­ing would de­pend on whether the num­bers used by the al­go­rithm hap­pened to have neg­a­tive signs in front. After all, in com­puter bi­na­ry, neg­a­tive num­bers have no mi­nus sign but are just an­other se­quence of 0s and 1s, and at the level of com­puter hard­ware, they look differ­ent still. More­over, if the agent was pre­vi­ously re­act­ing aver­sively to harm­ful stim­uli, it would con­tinue to do so. As Lenhart K. Schu­bert ex­plains:8 [This quo­ta­tion comes from spring 2014 lec­ture notes (ac­cessed March 2014) for a course called “Ma­chines and Con­scious­ness”.]

If the shift in ori­gin [to make neg­a­tive re­wards pos­i­tive] causes no be­hav­ioural change, then the ro­bot (anal­o­gous­ly, a per­son) would still be­have as if suffer­ing, yelling for help, etc., when in­jured or oth­er­wise in trou­ble, so it seems that the pain would not have been ban­ished after all!

So then what dis­tin­guishes plea­sure from pain?

…A more plau­si­ble ac­count is that the differ­ence re­lates to ‘avoid­ing’ ver­sus ‘seek­ing.’ A neg­a­tive ex­pe­ri­ence is one that the agent tries to get out of and do less of in the fu­ture. For in­stance, in­jury should be an in­her­ently neg­a­tive ex­pe­ri­ence, be­cause if re­pair­ing in­jury was re­ward­ing for an agent, the agent would seek to in­jure it­self so as to do re­pairs more often. If we tried to re­ward avoid­ance of in­jury, the agent would seek dan­ger­ous sit­u­a­tions so that it could en­joy re­turn­ing to safe­ty.10 [This ex­am­ple comes from Lenhart K. Schu­bert’s spring 2014 lec­ture notes (ac­cessed March 2014), for a course called ‘Ma­chines and Con­scious­ness.’ These thought ex­per­i­ments are not purely aca­d­e­m­ic. We can see an ex­am­ple of mal­adap­tive be­hav­iour re­sult­ing from an as­so­ci­a­tion of plea­sure with in­jury when peo­ple be­come ad­dicted to the en­dor­phin re­lease of self­-har­m.]19 In­jury needs to be some­thing the agent wants to get as far away from as pos­si­ble. So, for ex­am­ple, even if vom­it­ing due to food poi­son­ing is the best re­sponse you can take given your cur­rent sit­u­a­tion, the ex­pe­ri­ence should be neg­a­tive in or­der to dis­suade you from eat­ing spoiled foods again. Still, the dis­tinc­tion be­tween avoid­ing and seek­ing is­n’t al­ways clear. We ex­pe­ri­ence plea­sure due to seek­ing and con­sum­ing food but also pain that mo­ti­vates us to avoid hunger. Seek­ing one thing is often equiv­a­lent to avoid­ing an­oth­er. Like­wise with the pole-bal­anc­ing agent: Is it seek­ing a bal­anced pole, or avoid­ing a pole that falls over?

…Where does all of this leave our pole-bal­anc­ing agent? Does it suffer con­stant­ly, or is it en­joy­ing its efforts? Like­wise, is an RL agent that aims to ac­cu­mu­late pos­i­tive re­wards hav­ing fun, or is it suffer­ing when its re­ward is sub­op­ti­mal?

Pain as Grounding

So with all that for back­ground, what is the pur­pose of pain?

The pur­pose of pain, I would say, is as a ground truth or outer loss. (This is a mo­ti­va­tional the­ory of pain with a more so­phis­ti­cated RL/psychiatric ground­ing.)

The pain reward/loss can­not be re­moved en­tirely for the rea­sons demon­strated by the diabetics/lepers/congenital in­sen­si­tives: the un­no­ticed in­juries and the poor plan­ning are ul­ti­mately fa­tal. With­out any pain qualia to make pain feel painful, we will do harm­ful things like run on a bro­ken leg or jump off a roof to im­press our friends20, or just move in a not-quite-right fash­ion and a few years later wind up para­plegics. (An in­trin­sic cu­rios­ity drive alone would in­ter­act badly with a to­tal ab­sence of painful pain: after all, what is more novel or harder to pre­dict than the strange and unique states which can be reached by self­-in­jury or reck­less­ness?)

If pain could­n’t be re­moved, could pain be turned into a re­ward, then? Could we be the equiv­a­lent of Morsel­la’s mind that does­n’t ex­pe­ri­ence pain, as it in­fers plans and then ex­e­cutes them, ex­pe­ri­enc­ing only more or less re­wards? It only ex­pe­ri­ence pos­i­tive re­wards (plea­sure) as it runs across burn­ing-hot sands, as this is the op­ti­mal ac­tion for it to be tak­ing ac­cord­ing to what­ever grand plan it has thought of.

Per­haps we could… but what stops Morsel­la’s mind from en­joy­ing re­wards by lit­er­ally run­ning in cir­cles on those sands un­til it dies or is crip­pled? Morsel­la’s mind may make a plan and de­fine a re­ward func­tion which avoids the need for any pain or neg­a­tive re­wards, but what hap­pens if there is any flaw in the com­puted plan or the re­ward es­ti­mates? Or if the plan is based on mis­taken premis­es? What if the sands are hot­ter than ex­pect­ed, or if the dis­tance is much fur­ther than ex­pect­ed, or if the fi­nal goal (per­haps an oa­sis of wa­ter) is not there? Such a mind raises se­ri­ous ques­tions about learn­ing and deal­ing with er­rors: what does such a mind ex­pe­ri­ence when a plan fails? Does it ex­pe­ri­ence noth­ing? Does it ex­pe­ri­ence a kind of “meta-pain”?

Con­sider what Brand (The Gift of Pain again, pg191–197) de­scribes as the ul­ti­mate cause of the fail­ure of years of re­search into cre­at­ing ‘pain pros­thet­ics’, com­put­er­ized gloves & socks that would mea­sure heat & pres­sure in re­al-time in or­der to warn those with­out pain like lep­ers or di­a­bet­ics: the pa­tients would just ig­nore the warn­ings, be­cause stop­ping to pre­vent fu­ture prob­lems was in­con­ve­nient while con­tin­u­ing paid off now. And when elec­tri­cal shock­ers were added to the sys­tem to stop them from do­ing a dan­ger­ous thing, Brand ob­served pa­tients sim­ply dis­abling it to do the dan­ger­ous thing & re-en­abling it after­wards!

What pain pro­vides is a con­stant, on­go­ing feed­back which an­chors all the es­ti­mates of fu­ture re­wards based on plan­ning or boot­strap­ping. It an­chors our in­tel­li­gence in a con­crete es­ti­ma­tion of bod­ily in­tegri­ty: the in­tact­ness of skin, the health of skin cells, the lack of dam­age to mus­cles, joints slid­ing and mov­ing as they ought to, and so on. If we are plan­ning well and act­ing effi­ciently in the world, we will, in the long run, on av­er­age, ex­pe­ri­ence higher lev­els of bod­ily in­tegrity and phys­i­cal health; if we are learn­ing and choos­ing and plan­ning poor­ly, then… we won’t. The bad­ness will grad­u­ally catch up with us and we may find our­selves blind scarred para­plegics miss­ing fin­gers and soon to die. A pain that was not painful would not serve this pur­pose, as it would merely be an­other kind of “tick­ling” sen­sa­tion. (Some might find it in­ter­est­ing or en­joy­able or it could ac­ci­den­tally be­come sex­u­al­ly-linked.) The per­cep­tions in ques­tion are sim­ply more or­di­nary tac­tile, kines­thet­ic, , or other stan­dard cat­e­gories of per­cep­tion; with­out painful pain, a fire burn­ing your hand sim­ply feels warm (be­fore the ther­mal-per­cep­tive nerves are de­stroyed and noth­ing fur­ther is felt), and a knife cut­ting flesh might feel like a rip­pling stretch­ing rub­bing move­ment.

We might say that a painful pain is a pain which forcibly in­serts it­self into the planning/optimization process, as a cost or lack of re­ward to be op­ti­mized. A pain which was not mo­ti­vat­ing is not what we mean by ‘pain’ at all.21 The mo­ti­va­tion it­self is the qualia of pain, much like an itch is an or­di­nary sen­sa­tion cou­pled with a mo­ti­va­tional urge to scratch. Any men­tal qual­ity or emo­tion or sen­sa­tion which is not ac­com­pa­nied by a de­mand­ing­ness, an in­vol­un­tary tak­ing-in­to-con­sid­er­a­tion, is not pain. The rest of our mind can force its way through pain, if it is suffi­ciently con­vinced that there is enough rea­son to in­cur the costs of pain be­cause the long-term re­ward is so great, and we do this all the time: we can con­vince our­selves to go to the gym, or with­stand the vac­ci­na­tion needle, or, in the ut­most ex­trem­i­ty, saw off a trapped hand to save our life. And if we are mis­tak­en, and the pre­dicted re­wards do not ar­rive, even­tu­ally the noisy con­stant feed­back of pain will over­ride the de­ci­sions lead­ing to pain, and what­ever in­cor­rect be­liefs or mod­els led to the in­cor­rect de­ci­sions will be ad­justed to do bet­ter in the fu­ture.

But the pain can­not and must not be over­rid­den: hu­man or­gan­isms can’t be trusted to sim­ply ‘turn off’ pain and in­dulge an idle cu­rios­ity about cut­ting off hands. We are in­suffi­ciently in­tel­li­gent, our pri­ors in­suffi­ciently strong, our rea­son­ing and plan­ning too poor, and we must do too much learn­ing within each life to do with­out pain.

A sim­i­lar ar­gu­ment might ap­ply to the puz­zle of ‘willpower’, ‘pro­cras­ti­na­tion’. Why do we have such prob­lems, par­tic­u­larly in a mod­ern con­text, do­ing aught we know we should and do­ing naught we ought­n’t?

On the grave of the ‘blood glu­cose’ level the­o­ry, Kurzban et al 2013 (see later ) erects an op­por­tu­nity cost the­ory of willpow­er. Since ob­jec­tive phys­i­cal mea­sure­ments like blood glu­cose lev­els fail to me­chan­i­cally ex­plain poorer brain func­tion­al­ity or why stren­u­ous ac­tiv­i­ties like sports are ‘rest­ful’ & re­duce ‘burnout’, sim­i­lar to the fail­ure of ob­jec­tive phys­i­cal mea­sure­ments like lac­tate lev­els to ex­plain why peo­ple are able to phys­i­cally ex­er­cise only a cer­tain amount (de­spite be­ing able to ex­er­cise far more if prop­erly mo­ti­vated or if tricked), the rea­son for willpower run­ning out must be sub­jec­tive.

To ex­plain the sug­ar-re­lated ob­ser­va­tions, Kurzban et al 2013 sug­gest that the aver­sive­ness of long fo­cus and cog­ni­tive effort is a sim­ple heuris­tic which cre­ates a base­line cost to fo­cus­ing for ‘too long’ on any one task, to the po­ten­tial ne­glect of other op­por­tu­ni­ties, with the sugar in­ter­ven­tions (such as merely tast­ing sugar wa­ter) which ap­pear to boost willpower ac­tu­ally serv­ing as prox­i­mate re­ward sig­nals (sig­nals, be­cause the ac­tual en­er­getic con­tent is nil, and cog­ni­tive effort does­n’t mean­ing­fully burns calo­ries in the first place), which jus­tify to the un­der­ly­ing heuris­tic that fur­ther effort on the same task is worth­while and the op­por­tu­nity cost is min­i­mal.

The lack of willpower is a heuris­tic which does­n’t re­quire the brain to ex­plic­itly track & pri­or­i­tize & sched­ule all pos­si­ble tasks, by forc­ing it to reg­u­larly halt tasks—“like a timer that says, ‘Okay you’re done now.’” If one could over­ride fa­tigue at will, the con­se­quences can be bad. Users of dopamin­er­gic drugs like am­phet­a­mines often note is­sues with chan­nel­ing the re­duced fa­tigue into use­ful tasks rather than al­pha­bet­iz­ing one’s book­case. In more ex­treme cas­es, if one could ig­nore fa­tigue en­tire­ly, then anal­o­gous to lack of pain, the con­se­quences could be se­vere or fa­tal: ul­tra­-en­durance cy­clist would cy­cle for thou­sands of kilo­me­ters, ig­nor­ing such prob­lems as elab­o­rate hal­lu­ci­na­tions, and was even­tu­ally killed while cy­cling. The ‘timer’ is im­ple­ment­ed, among other things, as a grad­ual buildup of , which cre­ates and pos­si­bly phys­i­cal fa­tigue dur­ing ex­er­cise (, Mar­tin et al 2018), lead­ing to a grad­u­ally in­creas­ing sub­jec­tively per­ceived ‘cost’ of con­tin­u­ing with a task/staying awake/continuing ath­letic ac­tiv­i­ties, which re­sets when one stops/sleeps/rests. Since the hu­man mind is too lim­ited in its plan­ning and mon­i­tor­ing abil­i­ty, it can­not be al­lowed to ‘turn off’ op­por­tu­nity cost warn­ings and en­gage in hy­per­fo­cus on po­ten­tially use­less things at the ne­glect of all other things; pro­cras­ti­na­tion here rep­re­sents a psy­chic ver­sion of pain.

From this per­spec­tive, it is not sur­pris­ing that so many stim­u­lants are adenosin­er­gic or dopamin­er­gic22, or that many an­ti-pro­cras­ti­na­tion strate­gies like or the Pro­cras­ti­na­tion Equa­tion boil down to op­ti­miz­ing for more re­wards or more fre­quent re­wards (eg break­ing tasks down into many smaller tasks, which can be com­pleted in­di­vid­u­ally & re­ceive smaller but more fre­quent re­wards, or think­ing more clearly about whether some­thing is worth do­ing): all of these would affect the re­ward per­cep­tion it­self, and re­duce the base­line op­por­tu­nity cost ‘pain’. This per­spec­tive may also shed light on and why restora­tive hob­bies are ide­ally max­i­mally differ­ent from jobs and more mis­cel­la­neous ob­ser­va­tions like the lower rate of ‘hob­bies’ out­side the West: burnout may be a long-term home­o­sta­tic re­ac­tion to spend­ing ‘too much’ time too fre­quently on a diffi­cult not-im­me­di­ately re­ward­ing task de­spite ear­lier at­tempts to pur­sue other op­por­tu­ni­ties, which were al­ways over­rid­den, ul­ti­mately re­sult­ing in a to­tal col­lapse; and hob­bies ought to be as differ­ent in lo­ca­tion and phys­i­cal ac­tiv­ity and so­cial struc­ture (eg a soli­tary pro­gram­mer in­doors should pur­sue a so­cial phys­i­cal ac­tiv­ity out­doors) to en­sure that it feels com­pletely differ­ent for the mind than the reg­u­lar oc­cu­pa­tion; and in places with less job spe­cial­iza­tion or fewer work-hours, the reg­u­lar flow of a va­ri­ety of tasks and op­por­tu­ni­ties means that no such spe­cial ac­tiv­ity as a ‘hobby’ is nec­es­sary.

Per­haps if we were su­per­in­tel­li­gent AIs who could triv­ially plan flaw­less hu­manoid lo­co­mo­tion at 1000Hz tak­ing into ac­count all pos­si­ble dam­ages, or if we were em­u­lated brains sculpted by end­less evo­lu­tion­ary pro­ce­dures to ex­e­cute per­fectly adap­tive plans by pure in­stinct, or if we were sim­ple amoeba in a Petri dish who had no real choices to make, there would be no need for a pain which was painful. And like­wise, were we end­lessly plan­ning and re­plan­ning to the end of days, we should never ex­pe­ri­ence akrasia, we should merely do what is nec­es­sary (per­haps not even ex­pe­ri­enc­ing any qualia of effort or de­lib­er­a­tion, merely ). But we are not. The pain keeps us hon­est. In the end, pain is our only teacher.

The Perpetual Peace

“These laws, taken in the largest sense, be­ing Growth with Re­pro­duc­tion; In­her­i­tance which is al­most im­plied by re­pro­duc­tion; Vari­abil­ity from the in­di­rect and di­rect ac­tion of the ex­ter­nal con­di­tions of life, and from use and dis­use; a Ra­tio of In­crease so high as to lead to a Strug­gle for Life, and as a con­se­quence to Nat­ural Se­lec­tion, en­tail­ing Di­ver­gence of Char­ac­ter and the Ex­tinc­tion of less-im­proved forms. Thus, from the war of na­ture, from famine and death, the most ex­alted ob­ject which we are ca­pa­ble of con­ceiv­ing, name­ly, the pro­duc­tion of the higher an­i­mals, di­rectly fol­lows. There is grandeur in this view of life, with its sev­eral pow­ers, hav­ing been orig­i­nally breathed into a few forms or into one; and that, whilst this planet has gone cy­cling on ac­cord­ing to the fixed law of grav­i­ty, from so sim­ple a be­gin­ning end­less forms most beau­ti­ful and most won­der­ful have been, and are be­ing, evolved.”

Charles Dar­win, On the Ori­gin of Species

“In war, there is the free pos­si­bil­ity that not only in­di­vid­ual de­ter­mi­na­cies, but the sum to­tal of the­se, will be de­stroyed as life, whether for the ab­solute it­self or for the peo­ple. Thus, war pre­serves the eth­i­cal health of peo­ples in their in­differ­ence to de­ter­mi­nate things [Bes­timmtheiten]; it pre­vents the lat­ter from hard­en­ing, and the peo­ple from be­com­ing ha­bit­u­ated to them, just as the move­ment of the winds pre­serves the seas from that stag­na­tion which a per­ma­nent calm would pro­duce, and which a per­ma­nent (or in­deed ‘per­pet­ual’) peace would pro­duce among peo­ples.”

G.W.F. Hegel23

“We must rec­og­nize that war is com­mon, strife is jus­tice, and all things hap­pen ac­cord­ing to strife and ne­ces­si­ty…War is fa­ther of all and king of all”

Her­a­cli­tus, B80/B53

“It is not enough to suc­ceed; oth­ers must fail.”


What if we re­move the outer loss?

In a meta-learn­ing con­text, it will then ei­ther over­fit to a sin­gle in­stance of a prob­lem, or learn a po­ten­tially ar­bi­trar­ily sub­op­ti­mal av­er­age re­spon­se; in the Quake CTF, the in­ner loss might con­verge, as men­tioned, to every-a­gen­t-for-it­self or greedy tac­ti­cal vic­to­ries guar­an­tee­ing strate­gic loss­es; in a hu­man, the re­sult would (at pre­sent, due to re­fusal to use ar­ti­fi­cial se­lec­tion or ge­netic en­gi­neer­ing) be a grad­ual buildup of lead­ing to se­ri­ous health is­sues and even­tu­ally per­haps a mu­ta­tional meltdown/error cat­a­stro­phe; and in an econ­o­my, it leads to… the USSR.

The amount of this con­straint can vary, based on the greater power of the non-ground-truth op­ti­miza­tion and fi­delity of repli­ca­tion and ac­cu­racy of se­lec­tion. The gives us quan­ti­ta­tive in­sight into the con­di­tions un­der which could work at all: if a NN could only copy it­self in a crude and lossy way, meta-learn­ing would not work well in the first place (prop­er­ties must be pre­served from one gen­er­a­tion to the nex­t); if a hu­man cell copied it­self with an er­ror rate of as much as 1 in mil­lions, hu­mans could never ex­ist be­cause re­pro­duc­tive fit­ness is too weak a re­ward to purge the es­ca­lat­ing mu­ta­tion load (s­e­lec­tive gain is neg­a­tive); if bank­ruptcy be­comes more ar­bi­trary and have less to do with con­sumer de­mand than acts of god/government, then cor­po­ra­tions will be­come more patho­log­i­cally in­effi­cient (co­vari­ance be­tween traits & fit­ness too small to ac­cu­mu­late in mean­ing­ful ways).

As Shal­izi con­cludes in his re­view:

Plan­ning is cer­tainly pos­si­ble within lim­ited do­main­s—at least if we can get good data to the plan­ner­s—and those lim­its will ex­pand as com­put­ing power grows. But plan­ning is only pos­si­ble within those do­mains be­cause mak­ing money gives firms (or fir­m-like en­ti­ties) an ob­jec­tive func­tion which is both un­am­bigu­ous and blink­ered. Plan­ning for the whole econ­omy would, un­der the most fa­vor­able pos­si­ble as­sump­tions, be in­tractable for the fore­see­able fu­ture, and de­cid­ing on a plan runs into diffi­cul­ties we have no idea how to solve. The sort of effi­cient planned econ­omy dreamed of by the char­ac­ters in Red Plenty is some­thing we have no clue of how to bring about, even if we were will­ing to ac­cept dic­ta­tor­ship to do so.

This is why the plan­ning al­go­rithms can­not sim­ply keep grow­ing and take over all mar­kets: “who watches the watch­men?” As pow­er­ful as the var­i­ous in­ter­nal or­ga­ni­za­tional and plan­ning al­go­rithms are, and much su­pe­rior to evolution/market com­pe­ti­tion, they only op­ti­mize sur­ro­gate in­ner loss­es, which are not the end-goal, and they must be con­strained by a ground-truth loss. The re­liance on this loss can and should be re­duced, but a re­duc­tion to zero is un­de­sir­able as long as the in­ner losses con­verge to any op­tima differ­ent from the ground-truth op­ti­ma.

Given the often long lifes­pan of a fail­ing cor­po­ra­tion, the diffi­culty cor­po­ra­tions en­counter in align­ing em­ploy­ees with their goals, and the in­abil­ity to re­pro­duce their ‘cul­ture’, it is no won­der that group se­lec­tion in mar­kets is fee­ble at best, and the outer loss can­not be re­moved. On the other hand, these fail­ings are not nec­es­sar­ily per­ma­nent: as cor­po­ra­tions grad­u­ally turn into soft­ware, which can be copied and ex­ist in much more dy­namic mar­kets with faster OODA loops, per­haps we can ex­pect a tran­si­tion to an era where cor­po­ra­tions do repli­cate pre­cisely & can then start to con­sis­tently evolve large in­creases in effi­cien­cy, rapidly ex­ceed­ing all progress to date.

See Also


Meta-Learning Paradigms

From : “Ta­ble 1. A com­par­i­son of pub­lished meta-learn­ing ap­proach­es.”

Pain prosthetics

Brand & Yancey’s 1993 Pain: The Gift No One Wants, pg191–197, re­counts Brand’s re­search in the 1960s–1970s in at­tempt­ing to cre­ate ‘ar­ti­fi­cial pain’ or ‘pain pros­thet­ics’, which ul­ti­mately failed be­cause hu­man per­cep­tion of pain is mar­velously ac­cu­rate & su­pe­rior to the crude elec­tron­ics of the day, but more fun­da­men­tally be­cause they dis­cov­ered the aver­sive­ness of pain was crit­i­cal to ac­com­plish­ing the goal of dis­cour­ag­ing repet­i­tive or severe­ly-dam­ag­ing be­hav­ior, as the test sub­jects would sim­ply ig­nore or dis­able the de­vices to get on with what­ever they were do­ing. Ex­cerpts:

My grant ap­pli­ca­tion bore the ti­tle “A Prac­ti­cal Sub­sti­tute for Pain.” We pro­posed de­vel­op­ing an ar­ti­fi­cial pain sys­tem to re­place the de­fec­tive sys­tem in peo­ple who suffered from lep­rosy, con­gen­i­tal pain­less­ness, di­a­betic neu­ropa­thy, and other nerve dis­or­ders. Our pro­posal stressed the po­ten­tial eco­nomic ben­e­fits: by in­vest­ing a mil­lion dol­lars to find a way to alert such pa­tients to the worst dan­gers, the gov­ern­ment might save many mil­lions in clin­i­cal treat­ment, am­pu­ta­tions, and re­ha­bil­i­ta­tion.

The pro­posal caused a stir at the Na­tional In­sti­tutes of Health in Wash­ing­ton. They had re­ceived ap­pli­ca­tions from sci­en­tists who wanted to di­min­ish or abol­ish pain, but never from one who wished to cre­ate pain. Nev­er­the­less, we re­ceived fund­ing for the pro­ject.

We planned, in effect, to du­pli­cate the hu­man ner­vous sys­tem on a very small scale. We would need a sub­sti­tute “nerve sen­sor” to gen­er­ate sig­nals at the ex­trem­i­ty, a “nerve axon” or wiring sys­tem to con­vey the warn­ing mes­sage, and a re­sponse de­vice to in­form the brain of the dan­ger. Ex­cite­ment grew in the Carville re­search lab­o­ra­to­ry. We were at­tempt­ing some­thing that, to our knowl­edge, had never been tried.

I sub­con­tracted with the elec­tri­cal en­gi­neer­ing de­part­ment at Louisiana State Uni­ver­sity to de­velop a minia­ture sen­sor for mea­sur­ing tem­per­a­ture and pres­sure. One of the en­gi­neers there joked about the po­ten­tial for profit: “If our idea works, we’ll have a pain sys­tem that warns of dan­ger but does­n’t hurt. In other words, we’ll have the good parts of pain with­out the bad! Healthy peo­ple will de­mand these gad­gets for them­selves in place of their own pain sys­tems. Who would­n’t pre­fer a warn­ing sig­nal through a hear­ing aid over real pain in a fin­ger?”

The LSU en­gi­neers soon showed us pro­to­type trans­duc­ers, slim metal disks smaller than a shirt but­ton. Suffi­cient pres­sure on these trans­duc­ers would al­ter their elec­tri­cal re­sis­tance, trig­ger­ing an elec­tri­cal cur­rent. They asked our re­search team to de­ter­mine what thresh­olds of pres­sure should be pro­grammed into the minia­ture sen­sors. I re­played my uni­ver­sity days in Tommy Lewis’s pain lab­o­ra­to­ry, with one big differ­ence: now, in­stead of merely test­ing the in­-built prop­er­ties of a well-de­signed hu­man body, I had to think like the de­sign­er. What dan­gers would that body face? How could I quan­tify those dan­gers in a way the sen­sors could mea­sure?

To sim­plify mat­ters, we fo­cused on fin­ger­tips and the soles of feet, the two ar­eas that caused our pa­tients the most prob­lems. But how could we get a me­chan­i­cal sen­sor to dis­tin­guish be­tween the ac­cept­able pres­sure of, say, grip­ping a fork and the un­ac­cept­able pres­sure of grip­ping a piece of bro­ken glass? How could we cal­i­brate the stress level of or­di­nary walk­ing and yet al­low for the oc­ca­sional ex­tra stress of step­ping off a curb or jump­ing over a pud­dle? Our pro­ject, which we had be­gun with such en­thu­si­asm, seemed more and more daunt­ing.

I re­mem­bered from stu­dent days that nerve cells change their per­cep­tion of pain in ac­cor­dance with the body’s needs. We say a fin­ger feels ten­der: thou­sands of nerve cells in the dam­aged tis­sue au­to­mat­i­cally lower their thresh­old of pain to dis­cour­age us from us­ing the fin­ger. An in­fected fin­ger seems as if it is al­ways get­ting bumped—it “sticks out like a sore thumb”—be­cause in­flam­ma­tion has made it ten times more sen­si­tive to pain. No me­chan­i­cal trans­ducer could be so re­spon­sive to the needs of liv­ing tis­sue.

Every month the op­ti­mism level of the re­searchers went down a notch. Our Carville team, who had made the sig­nifi­cant find­ings about repet­i­tive stress and con­stant stress, knew that the worst dan­gers came not from ab­nor­mal stress­es, but from very nor­mal stresses re­peated thou­sands of times, as in the act of walk­ing. And Sher­man the pig24 had demon­strated that a con­stant pres­sure as low as one pound per square inch could cause skin dam­age. How could we pos­si­bly pro­gram all these vari­ables into a minia­ture trans­duc­er? We would need a com­puter chip on every sen­sor just to keep track of chang­ing vul­ner­a­bil­ity of tis­sues to dam­age from repet­i­tive stress. We gained a new re­spect for the hu­man body’s ca­pac­ity to sort through such diffi­cult op­tions in­stan­ta­neous­ly.

After many com­pro­mises we set­tled on base­line pres­sures and tem­per­a­tures to ac­ti­vate the sen­sors, and then de­signed a glove and a sock to in­cor­po­rate sev­eral trans­duc­ers. At last we could test our sub­sti­tute pain sys­tem on ac­tual pa­tients. Now we ran into me­chan­i­cal prob­lems. The sen­sors, state-of-the-art elec­tronic minia­tures, tended to de­te­ri­o­rate from metal fa­tigue or cor­ro­sion after a few hun­dred us­es. Short­-cir­cuits made them fire off false alarms, which ag­gra­vated our vol­un­teer pa­tients. Worse, the sen­sors cost about $2,060$4501970 each and a lep­rosy pa­tient who took a long walk around the hos­pi­tal grounds could wear out a $9,156$20001970 sock!

On av­er­age, a set of trans­duc­ers held up to nor­mal wear-and-tear for one or two weeks. We cer­tainly could not afford to let a pa­tient wear one of our ex­pen­sive gloves for a task like rak­ing leaves or pound­ing a ham­mer—the very ac­tiv­i­ties we were try­ing to make safe. Be­fore long the pa­tients were wor­ry­ing more about pro­tect­ing our trans­duc­ers, their sup­posed pro­tec­tors, than about pro­tect­ing them­selves.

Even when the trans­duc­ers worked cor­rect­ly, the en­tire sys­tem was con­tin­gent on the free will of the pa­tients. We had grandly talked of re­tain­ing “the good parts of pain with­out the bad,” which meant de­sign­ing a warn­ing sys­tem that would not hurt. First we tried a de­vice like a hear­ing aid that would hum when the sen­sors were re­ceiv­ing nor­mal pres­sures, buzz when they were in slight dan­ger, and emit a pierc­ing sound when they per­ceived an ac­tual dan­ger. But when a pa­tient with a dam­aged hand turned a screw­driver too hard, and the loud warn­ing sig­nal went off, he would sim­ply over­ride it—This glove is al­ways send­ing out false sig­nals—and turn the screw­driver any­way. Blink­ing lights failed for the same rea­son.

Pa­tients who per­ceived “pain” only in the ab­stract could not be per­suaded to trust the ar­ti­fi­cial sen­sors. Or they be­came bored with the sig­nals and ig­nored them. The sober­ing re­al­iza­tion dawned on us that un­less we built in a qual­ity of com­pul­sion, our sub­sti­tute sys­tem would never work. Be­ing alerted to the dan­ger was not enough; our pa­tients had to be forced to re­spond. Pro­fes­sor Tims of LSU said to me, al­most in de­spair, “Paul, it’s no use. We’ll never be able to pro­tect these limbs un­less the sig­nal re­ally hurts. Surely there must be some way to hurt your pa­tients enough to make them pay at­ten­tion.”

We tried every al­ter­na­tive be­fore re­sort­ing to pain, and fi­nally con­cluded Tims was right: the stim­u­lus had to be un­pleas­ant, just as pain is un­pleas­ant. One of Tim­s’s grad­u­ate stu­dents de­vel­oped a small bat­tery-op­er­ated coil that, when ac­ti­vat­ed, sent out an elec­tric shock at high volt­age but low cur­rent. It was harm­less but painful, at least when ap­plied to parts of the body that could feel pain.

Lep­rosy bacil­li, fa­vor­ing the cooler parts of the body, usu­ally left warm re­gions such as the armpit undis­turbed, and so we be­gan tap­ing the elec­tric coil to pa­tients’ armpits for our tests. Some vol­un­teers dropped out of the pro­gram, but a few brave ones stayed on. I no­ticed, though, that they viewed pain from our ar­ti­fi­cial sen­sors in a differ­ent way than pain from nat­ural sources. They tended to see the elec­tric shocks as pun­ish­ment for break­ing rules, not as mes­sages from an en­dan­gered body part. They re­sponded with re­sent­ment, not an in­stinct of self­-p­reser­va­tion, be­cause our ar­ti­fi­cial sys­tem had no in­nate link to their sense of self. How could it, when they felt a jolt in the armpit for some­thing hap­pen­ing to the hand?

I learned a fun­da­men­tal dis­tinc­tion: a per­son who never feels pain is task-ori­ent­ed, whereas a per­son who has an in­tact pain sys­tem is self­-ori­ent­ed. The pain­less per­son may know by a sig­nal that a cer­tain ac­tion is harm­ful, but if he re­ally wants to, he does it any­way. The pain-sen­si­tive per­son, no mat­ter how much he wants to do some­thing, will stop for pain, be­cause deep in his psy­che he knows that pre­serv­ing his own self is more sig­nifi­cant than any­thing he might want to do.

Our project went through many stages, con­sum­ing five years of lab­o­ra­tory re­search, thou­sands of man-hours, and more than a mil­lion dol­lars of gov­ern­ment funds. In the end we had to aban­don the en­tire scheme. A warn­ing sys­tem suit­able for just one hand was ex­or­bi­tantly ex­pen­sive, sub­ject to fre­quent me­chan­i­cal break­down, and hope­lessly in­ad­e­quate to in­ter­pret the pro­fu­sion of sen­sa­tions that con­sti­tute touch and pain. Most im­por­tant, we found no way around the fun­da­men­tal weak­ness in our sys­tem: it re­mained un­der the pa­tien­t’s con­trol. If the pa­tient did not want to heed the warn­ings from our sen­sors, he could al­ways find a way to by­pass the whole sys­tem.

Look­ing back, I can point to a sin­gle in­stant when I knew for cer­tain that the sub­sti­tute pain project would not suc­ceed. I was look­ing for a tool in the man­ual arts work­shop when Charles, one of our vol­un­teer pa­tients, came in to re­place a gas­ket on a mo­tor­cy­cle en­gine. He wheeled the bike across the con­crete floor, kicked down the kick­stand, and set to work on the gaso­line en­gine. I watched him out of the cor­ner of my eye. Charles was one of our most con­sci­en­tious vol­un­teers, and I was ea­ger to see how the ar­ti­fi­cial pain sen­sors on his glove would per­form.

One of the en­gine bolts had ap­par­ently rust­ed, and Charles made sev­eral at­tempts to loosen it with a wrench. It did not give. I saw him put some force be­hind the wrench, and then stop abrupt­ly, jerk­ing back­ward. The elec­tric coil must have jolted him. (I could never avoid winc­ing when I saw our man-made pain sys­tem func­tion as it was de­signed to do.) Charles stud­ied the sit­u­a­tion for a mo­ment, then reached up un­der his armpit and dis­con­nected a wire. He forced the bolt loose with a big wrench, put his hand in his shirt again, and re­con­nected the wire. It was then that I knew we had failed. Any sys­tem that al­lowed our pa­tients free­dom of choice was doomed.

I never ful­filled my dream of “a prac­ti­cal sub­sti­tute for pain,” but the process did at last set to rest the two ques­tions that had long haunted me. Why must pain be un­pleas­ant? Why must pain per­sist? Our sys­tem failed for the pre­cise rea­son that we could not effec­tively re­pro­duce those two qual­i­ties of pain. The mys­te­ri­ous power of the hu­man brain can force a per­son to STOP!—something I could never ac­com­plish with my sub­sti­tute sys­tem. And “nat­ural” pain will per­sist as long as dan­ger threat­ens, whether we want it to or not; un­like my sub­sti­tute sys­tem, it can­not be switched off.

As I worked on the sub­sti­tute sys­tem, I some­times thought of my rheuma­toid arthri­tis pa­tients, who yearned for just the sort of on-off switch we were in­stalling. If rheuma­toid pa­tients had a switch or a wire they could dis­con­nect, most would de­stroy their hands in days or weeks. How for­tu­nate, I thought, that for most of us the pain switch will al­ways re­main out of reach.

  1. See also SSC & Chris Said’s re­views.↩︎

  2. Amus­ing­ly, the front of Red Plenty notes a grant from Tar­get to the pub­lisher. ↩︎

  3. More Si­mon 1991:

    Over a span of years, a large frac­tion of all eco­nomic ac­tiv­ity has been gath­ered within the walls of large and steadily grow­ing or­ga­ni­za­tions. The green ar­eas ob­served by our Mar­t­ian have grown steadi­ly. Ijiri and I have sug­gested that the growth of or­ga­ni­za­tions may have only a lit­tle to do with effi­ciency (e­spe­cially since, in most large-s­cale en­ter­pris­es, economies and dis­ec­onomies of scale are quite smal­l), but may be pro­duced mainly by sim­ple sto­chas­tic growth mech­a­nisms (I­jiri and Si­mon, 1977).

    But if par­tic­u­lar co­or­di­na­tion mech­a­nisms do not de­ter­mine ex­actly where the bound­aries be­tween or­ga­ni­za­tions and mar­kets will lie, the ex­is­tence and effec­tive­ness of large or­ga­ni­za­tions does de­pend on some ad­e­quate set of pow­er­ful co­or­di­nat­ing mech­a­nisms be­ing avail­able. These means of co­or­di­na­tion in or­ga­ni­za­tions, taken in com­bi­na­tion with the mo­ti­va­tional mech­a­nisms dis­cussed ear­lier, cre­ate pos­si­bil­i­ties for en­hanc­ing pro­duc­tiv­ity and effi­ciency through the di­vi­sion of la­bor and spe­cial­iza­tion.

    In gen­er­al, as spe­cial­iza­tion of tasks pro­ceeds, the in­ter­de­pen­dency of the spe­cial­ized parts in­creas­es. Hence a struc­ture with effec­tive mech­a­nisms for co­or­di­na­tion can carry spe­cial­iza­tion fur­ther than a struc­ture lack­ing these mech­a­nisms. It has some­times been ar­gued that spe­cial­iza­tion of work in mod­ern in­dus­try pro­ceeded quite in­de­pen­dently of the rise of the fac­tory sys­tem. This may have been true of the early phases of the in­dus­trial rev­o­lu­tion, but would be hard to sus­tain in re­la­tion to con­tem­po­rary fac­to­ries. With the com­bi­na­tion of au­thor­ity re­la­tions, their mo­ti­va­tional foun­da­tions, a reper­tory of co­or­di­na­tive mech­a­nisms, and the di­vi­sion of labor, we ar­rive at the large hi­er­ar­chi­cal or­ga­ni­za­tions that are so char­ac­ter­is­tic of mod­ern life.

  4. In RL terms, evo­lu­tion, like , are a kind of Monte Carlo method. Monte Carlo meth­ods re­quire no knowl­edge or model of the en­vi­ron­ment, ben­e­fit from low bi­as, can han­dle even long-term con­se­quences with ease, do not di­verge or fail or are bi­ased like ap­proaches us­ing boot­strap­ping (e­spe­cially in the case of the “deadly triad”), is decentralized/embarrassingly par­al­lel. A ma­jor down­side, of course, is that they ac­com­plish all this by be­ing ex­tremely high-variance/sample-inefficient (eg Sal­i­mans et al 2017 is ~10x worse than com­pet­ing DRL meth­od­s).↩︎

  5. And note the irony of the wide­ly-cited corn & ex­am­ples of how farm­ing en­codes sub­tle wis­dom due to group se­lec­tion: in both cas­es, the groups that de­vel­oped it in the Amer­i­cas were, de­spite their su­pe­rior lo­cal food pro­cess­ing, highly ‘un­fit’ and suffered enor­mous pop­u­la­tion de­clines due to pan­demic & con­quest! You might ob­ject that those were ex­oge­nous fac­tors, bad luck, due to things un­re­lated to their food pro­cess­ing… which is pre­cisely the prob­lem when se­lect­ing on groups.↩︎

  6. An ex­am­ple of the fail­ure of tra­di­tional med­i­cine is pro­vided by the NCI an­ti-cancer plant screen­ing pro­gram, run by an en­thu­si­ast for med­ical folk­lore & eth­nob­otany who specifi­cally tar­geted plants based on a “a mas­sive lit­er­a­ture search, in­clud­ing an­cient Chi­ne­se, Egyp­tian, Greek, and Ro­man texts”. The screen­ing pro­gram screened “some 12,000 to 13,000 species…over 114,000 ex­tracts were tested for an­ti­tu­mor ac­tiv­ity” (rates ris­ing steeply after­ward­s), which yielded 3 drugs ever (/Taxol/PTX, , and ), only one of which was all that im­por­tant (Tax­ol). So, in a pe­riod with few use­ful an­ti-cancer drugs to com­pete again­st, large-s­cale screen­ing of all the low-hang­ing fruit, tar­get­ing plants prized by tra­di­tional med­ical prac­tices from through­out his­tory & across the globe, had a suc­cess rate some­where on the or­der of 0.007%.

    A re­cent ex­am­ple is the an­ti-malar­ial drug , which earned its dis­cov­er­er, , a 2015 No­bel; she worked in a lab ded­i­cated to tra­di­tional herbal med­i­cine (Mao Ze­dong en­cour­aged the con­struc­tion of a ‘tra­di­tional Chi­nese med­i­cine’ as a way to re­duce med­ical ex­penses and con­serve for­eign cur­ren­cy). She dis­cov­ered it in 1972, after screen­ing sev­eral thou­sand tra­di­tional Chi­nese reme­dies. Artemisinin is im­por­tant, and one might ask what else her lab dis­cov­ered in the trea­sure trove of tra­di­tional Chi­nese med­i­cine in the in­ter­ven­ing 43 years; the an­swer, ap­par­ent­ly, is ‘noth­ing’.

    While Taxol and artemisinin may jus­tify plant screen­ing on a pure cost-ben­e­fit ba­sis (such a hit rate does not ap­pear much worse than other meth­ods, al­though one should note that the profit-hun­gry phar­ma­ceu­ti­cal in­dus­try does not pri­or­i­tize or in­vest much in ‘’), the more im­por­tant les­son here is about the ac­cu­racy of ‘tra­di­tional med­i­cine’. Tra­di­tional med­i­cine affords an ex­cel­lent test case for ‘the wis­dom of tra­di­tion’: med­i­cine has hard end­points as it is lit­er­ally a mat­ter of life and death, is an is­sue dur­ing every in­di­vid­u­al’s life at the in­di­vid­ual level (rather than oc­ca­sion­ally at the group lev­el), effects can be ex­tremely large (bor­der­ing on ‘sil­ver bul­let’ lev­el) and tens of thou­sands or hun­dreds of thou­sands of years have passed for ac­cu­mu­la­tion & se­lec­tion. Given all of these fa­vor­able fac­tors, can the wis­dom of tra­di­tion still over­come the se­ri­ous sta­tis­ti­cal diffi­cul­ties and cog­ni­tive bi­ases lead­ing to false be­liefs? Well, the best suc­cess sto­ries of tra­di­tional med­i­cine have ac­cu­racy rates like… <1%. So much for the ‘wis­dom of tra­di­tion’. The fact that some work­ing drugs hap­pen to also have been men­tioned, some­times, in some tra­di­tions, in some ways, along with hun­dreds of thou­sands of use­less or harm­ful drugs which look just the same, is hardly any more tes­ti­mo­nial to the folk med­i­cine as a source of truth than the ob­ser­va­tion that Hein­rich Schlie­mann dis­cov­ered a city sort of like Troy jus­ti­fies treat­ing the Il­liad or Odyssey as ac­cu­rate his­tor­i­cal text­books rather than 99% fic­tional lit­er­a­ture. (Like­wise other ex­am­ples such as Aus­tralian Abo­rig­i­nal myths pre­serv­ing some traces of an­cient ge­o­log­i­cal events: they cer­tainly do not show that the oral his­to­ries are re­li­able his­to­ries or we should just take them as fac­t.)↩︎

  7. The French Rev­o­lu­tion: A His­tory, by .↩︎

  8. Brand also notes of a lep­rosy pa­tient whose nerves had been dead­ened by it:

    As I watched, this man tucked his crutches un­der his arm and be­gan to run on both feet with a very lop­sided gait….He ended up near the head of the line, where he stood pant­i­ng, lean­ing on his crutch­es, wear­ing a smile of tri­umph…By run­ning on an al­ready dis­lo­cated an­kle, he had put far too much force on the end of his leg bone and the skin had bro­ken un­der the stress…I knelt be­side him and found that small stones and twigs had jammed through the end of the bone into the mar­row cav­i­ty. I had no choice but to am­pu­tate the leg be­low the knee.

    These two scenes have long haunted me.

  9. An ex­am­ple quote from Brand & Yancey’s 1993 Pain: The Gift No One Wants about con­gen­i­tal pain in­sen­si­tiv­i­ty:

    When I un­wrapped the last ban­dage, I found grossly in­fected ul­cers on the soles of both feet. Ever so gen­tly I probed the wounds, glanc­ing at Tanya’s face for some re­ac­tion. She showed none. The probe pushed eas­ily through soft, necrotic tis­sue, and I could even see the white gleam of bare bone. Still no re­ac­tion from Tanya.

    …her mother told me Tanya’s sto­ry…“A few min­utes later I went into Tanya’s room and found her sit­ting on the floor of the playpen, fin­ger­paint­ing red swirls on the white plas­tic sheet. I did­n’t grasp the sit­u­a­tion at first, but when I got closer I screamed. It was hor­ri­ble. The tip of Tanya’s fin­ger was man­gled and bleed­ing, and it was her own blood she was us­ing to make those de­signs on the sheets. I yelled, ‘Tanya, what hap­pened!’ She grinned at me, and that’s when I saw the streaks of blood on her teeth. She had bit­ten off the tip of her fin­ger and was play­ing in the blood.”

    …The tod­dler laughed at spank­ings and other phys­i­cal threats, and in­deed seemed im­mune to all pun­ish­ment. To get her way she merely had to lift a fin­ger to her teeth and pre­tend to bite, and her par­ents ca­pit­u­lated at once. The par­ents’ hor­ror turned to de­spair as wounds mys­te­ri­ously ap­peared on one of Tanya’s fin­gers after an­oth­er…I asked about the foot in­juries. “They be­gan as soon as she learned to walk,” the mother replied. “She’d step on a nail or thumb­tack and not bother to pull it out. Now I check her feet at the end of every day, and often I dis­cover a new wound or open sore. If she twists an an­kle, she does­n’t limp, and so it twists again and again. An or­tho­pe­dic spe­cial­ist told me she’s per­ma­nently dam­aged the joint. If we wrap her feet for pro­tec­tion, some­times in a fit of anger she’ll tear off the ban­dages. Once she ripped open plas­ter cast with her bare fin­gers.”

    …Tanya suffered from a rare ge­netic de­fect known in­for­mally as “con­gen­i­tal in­differ­ence to pain”…N­erves in her hands and feet trans­mit­ted mes­sages—she felt a kind of tin­gling when she burned her­self or bit a fin­ger—but these car­ried no hint of un­pleas­ant­ness…She rather en­joyed the tin­gling sen­sa­tions, es­pe­cially when they pro­duced such dra­matic re­ac­tions in oth­er­s…­Tanya, now 11, was liv­ing a pa­thetic ex­is­tence in an in­sti­tu­tion. She had lost both legs to am­pu­ta­tion: she had re­fused to wear proper shoes and that, cou­pled with her fail­ure to limp or shift weight when stand­ing (be­cause she felt no dis­com­fort), had even­tu­ally put in­tol­er­a­ble pres­sure on her joints. Tanya had also lost most of her fin­gers. Her el­bows were con­stantly dis­lo­cat­ed. She suffered the effects of chronic sep­sis from ul­cers on her hands and am­pu­ta­tion stumps. Her tongue was lac­er­ated and badly scarred from her ner­vous habit of chew­ing it.

  10. One of the first known cases was de­scribed in Dear­born 1932, of a man with a re­mark­able ca­reer of in­juries as a child rang­ing from be­ing hoisted by a pick­-axe to a hatchet get­ting stuck in his head to shoot­ing him­self in the in­dex fin­ger, cul­mi­nat­ing in a mul­ti­-year ca­reer as the “Hu­man Pin­cush­ion”.↩︎

  11. The Chal­lenge of Pain, Melzack & Wall 1996, de­scribes an­other case (as quoted in , Gra­hek 2001):

    As a child, she had bit­ten off the tip of her tongue while chew­ing food, and has suffered third-de­gree burns after kneel­ing on a hot ra­di­a­tor to look out of the win­dow…Miss C. had se­vere med­ical prob­lems. She ex­hib­ited patho­log­i­cal changes in her knees, hip and spine, and un­der­went sev­eral or­tho­pe­dic op­er­a­tions. Her sur­geon at­trib­uted these changes to the lack of pro­tec­tion to joints usu­ally given by pain sen­sa­tion. She ap­par­ently failed to shift her weight when stand­ing, to turn over in her sleep, or to avoid cer­tain pos­tures, which nor­mally pre­vent the in­flam­ma­tion of joints. All of us quite fre­quently stum­ble, fall or wrench a mus­cle dur­ing or­di­nary ac­tiv­i­ty. After these triv­ial in­juries, we limp a lit­tle or we pro­tect the joint so that it re­mains un­stressed dur­ing the re­cov­ery process. This rest­ing of the dam­aged area is an es­sen­tial part of its re­cov­ery. But those who feel no pain go on us­ing the joint, adding in­sult to in­jury.

  12. A re­cent US ex­am­ple is Min­nesotan Gabby Gin­gras (b. 2001), fea­tured in the 2005 doc­u­men­tary A Life With­out Pain, and oc­ca­sion­ally cov­ered in the me­dia since (eg “Med­ical Mys­tery: A World With­out Pain: A rare ge­netic dis­or­der leaves one lit­tle girl in con­stant dan­ger”, “Min­nesota girl who can’t feel pain bat­tles in­sur­ance com­pany”).

    She is legally blind, hav­ing dam­aged her eyes & de­feated at­tempts to save her vi­sion by stitch­ing her eyes shut. She would chew on things, so her baby teeth were sur­gi­cally re­moved to avoid her break­ing them—but then she broke her adult teeth when they grow in; she can’t use den­tures be­cause her gums are so badly de­stroyed, which re­quires spe­cial surgery to graft bone from her hips into her jaw to pro­vide a foun­da­tion for teeth. And so on.↩︎

  13. HN user re­mote_­phone:

    My cousin feels pain or dis­com­fort but only a lit­tle. This al­most affected her when she gave birth be­cause her wa­ter had bro­ken but she did­n’t feel any con­trac­tions at all un­til it was al­most too late. Luck­ily she got to the hos­pi­tal in time and her son was born per­fectly nor­mal but it was a bit har­row­ing.

    More in­ter­est­ing­ly, her son in­her­ited this. He does­n’t feel pain the same way nor­mal peo­ple do. Once her son broke his wrist and had to go to the hos­pi­tal. He was­n’t in pain, but I think they had to pull on the arm to put it back in place prop­erly (is this called trac­tion?). The doc­tor was putting in all his effort to sep­a­rate the wrist from the arm, and the dad al­most fainted be­cause it looked so grue­some but all the son looked like was mildly dis­com­forted from the ten­sion. The doc­tor was ap­par­ently shocked at how lit­tle pain he felt.

    The son also pulled out all his teeth on his own, as they got loose. He said it both­ered him to have loose teeth, but the act of pulling them out did­n’t bother him at all.

  14. See “The Haz­ards of Grow­ing Up Pain­lessly” for a par­tic­u­larly re­cent ex­am­ple.↩︎

  15. A ge­net­ics pa­per, has a pro­file of a pain-in­sen­si­tive pa­tient (which is par­tic­u­larly eye­brow-rais­ing in light of ear­lier dis­cus­sions of joint dam­age):

    The pa­tient had been di­ag­nosed with os­teoarthri­tis of the hip, which she re­ported as pain­less, which was not con­sis­tent with the se­vere de­gree of joint de­gen­er­a­tion. At 65 yr of age, she had un­der­gone a hip re­place­ment and was ad­min­is­tered only parac­eta­mol 2g orally on Post­op­er­a­tive days 1 and 2, re­port­ing that she was en­cour­aged to take the parac­eta­mol, but that she did not ask for any anal­gesics. She was also ad­min­is­tered a sin­gle dose of mor­phine sul­phate 10mg orally on the first post­op­er­a­tive evening that caused se­vere nau­sea and vom­it­ing for 2 days. After op­er­a­tion, her pain in­ten­sity scores were 0⁄10 through­out ex­cept for one score of 1⁄10 on the first post­op­er­a­tive evening. Her past sur­gi­cal his­tory was no­table for mul­ti­ple vari­cose vein and den­tal pro­ce­dures for which she has never re­quired anal­ge­sia. She also re­ported a long his­tory of pain­less in­juries (e.g. su­tur­ing of a lac­er­a­tion and left wrist frac­ture) for which she did not use anal­gesics. She re­ported nu­mer­ous burns and cuts with­out pain (Sup­ple­men­tary Fig. S1), often smelling her burn­ing flesh be­fore notic­ing any in­jury, and that these wounds healed quickly with lit­tle or no resid­ual scar. She re­ported eat­ing chili pep­pers with­out any dis­com­fort, but a short­-last­ing “pleas­ant glow” in her mouth. She de­scribed sweat­ing nor­mally in warm con­di­tions.

  16. Brand’s Pain: The Gift No One Wants (pg209–211) de­scribes meet­ing an In­dian woman whose pain was cured by a lo­bot­omy (de­signed to sever as lit­tle of the pre­frontal cor­tex as pos­si­ble), who de­scribed it in al­most ex­actly the same term as Den­net­t’s para­phrase: “When I in­quired about the pain, she said, ‘Oh, yes, it’s still there. I just don’t worry about it any­more.’ She smiled sweetly and chuck­led to her­self. ‘In fact, it’s still ag­o­niz­ing. But I don’t mind.’” (Den­nett else­where draws a con­nec­tion be­tween ‘not mind­ing’ and Zen Bud­dhis­m.) See also Bar­ber 1959.↩︎

  17. Am­ne­si­acs ap­par­ently may still be able to learn fear or pain as­so­ci­a­tions with un­pleas­ant stim­uli de­spite their mem­ory im­pair­ment and some­times re­duced pain sen­si­tiv­i­ty, which makes them a bor­der­line case here: the aver­sive­ness out­lasts the (re­mem­bered) qualia.↩︎

  18. help­fully pro­vides a Rosetta Stone be­tween op­ti­mal con­trol the­ory & re­in­force­ment learn­ing (see also Pow­ell 2018 & Bert­sekas 2019):

    The no­ta­tion and ter­mi­nol­ogy used in this pa­per is stan­dard in DP and op­ti­mal con­trol, and in an effort to fore­stall con­fu­sion of read­ers that are ac­cus­tomed to ei­ther the re­in­force­ment learn­ing or the op­ti­mal con­trol ter­mi­nol­o­gy, we pro­vide a list of se­lected terms com­monly used in re­in­force­ment learn­ing (for ex­am­ple in the pop­u­lar book by Sut­ton and Barto [SuB­98], and its 2018 on-line 2nd edi­tion), and their op­ti­mal con­trol coun­ter­parts.

    1. Agent = Con­troller or de­ci­sion mak­er.
    2. Ac­tion = Con­trol.
    3. En­vi­ron­ment = Sys­tem.
    4. Re­ward of a stage = (Op­po­site of) Cost of a stage.
    5. State value = (Op­po­site of) Cost of a state.
    6. Value (or state-val­ue) func­tion = (Op­po­site of) Cost func­tion.
    7. Max­i­miz­ing the value func­tion = Min­i­miz­ing the cost func­tion.
    8. Ac­tion (or state-ac­tion) value = Q-fac­tor of a state-con­trol pair.
    9. Plan­ning = Solv­ing a DP prob­lem with a known math­e­mat­i­cal mod­el.
    10. Learn­ing = Solv­ing a DP prob­lem in mod­el-free fash­ion.
    11. Self­-learn­ing (or self­-play in the con­text of games) = Solv­ing a DP prob­lem us­ing pol­icy it­er­a­tion.
    12. Deep re­in­force­ment learn­ing = Ap­prox­i­mate DP us­ing value and/or pol­icy ap­prox­i­ma­tion with deep neural net­works.
    13. Pre­dic­tion = Pol­icy eval­u­a­tion.
    14. Gen­er­al­ized pol­icy it­er­a­tion = Op­ti­mistic pol­icy it­er­a­tion.
    15. State ab­strac­tion = Ag­gre­ga­tion.
    16. Episodic task or episode = Finite-step sys­tem tra­jec­to­ry.
    17. Con­tin­u­ing task = In­finite-step sys­tem tra­jec­to­ry.
    18. After­state = Post-de­ci­sion state.
  19. There are some ex­am­ples of “Re­ward hack­ing” in past RL re­search which re­sem­ble such ‘self­-in­jur­ing’ agents—­for ex­am­ple, a bi­cy­cle agent is ‘re­warded’ for get­ting near a tar­get (but not ‘pun­ished’ for mov­ing away), so it learn to steer to­ward it in a loop to go around it re­peat­edly to earn the re­ward.↩︎

  20. From the Mar­sili ar­ti­cle:

    In the mid-2000s, Wood’s lab at Uni­ver­sity Col­lege part­nered with a Cam­bridge Uni­ver­sity sci­en­tist named Ge­off Woods on a pi­o­neer­ing re­search project cen­tered on a group of re­lated fam­i­lies—all from a clan known as the Qureshi bi­radar­i—in rural north­ern Pak­istan. Woods had learned about the fam­i­lies ac­ci­den­tal­ly: On the hunt for po­ten­tial test sub­jects for a study on the brain ab­nor­mal­ity mi­cro­cepha­ly, he heard about a young street per­former, a boy who rou­tinely in­jured him­self (walk­ing across burn­ing coals, stab­bing him­self with knives) for the en­ter­tain­ment of crowds. The boy was ru­mored to feel no pain at all, a trait he was said to share with other fam­ily mem­ber­s…When Woods found the boy’s fam­i­ly, they told him that the boy had died from in­juries sus­tained dur­ing a stunt leap from a rooftop.

  21. Drescher 2004 gives a sim­i­lar ac­count of mo­ti­va­tional pain in (pg77–78):

    But a merely me­chan­i­cal state could not have the prop­erty of be­ing in­trin­si­cally de­sir­able or un­de­sir­able; in­her­ently good or bad sen­sa­tions, there­fore, would be ir­rec­on­cil­able with the idea of a fully me­chan­i­cal mind. Ac­tu­al­ly, though, it is your ma­chin­ery’s very re­sponse to a state’s util­ity des­ig­na­tion—the ma­chin­ery’s very ten­dency to sys­tem­at­i­cally pur­sue or avoid the state—that im­ple­ments and con­sti­tutes a val­ued state’s seem­ingly in­her­ent de­served­ness of be­ing pur­sued or avoid­ed. Roughly speak­ing, it’s not that you avoid pain (other things be­ing equal) in part be­cause pain is in­her­ently bad; rather, your ma­chin­ery’s sys­tem­atic ten­dency to avoid pain (other things be­ing equal) is what con­sti­tutes its be­ing bad. That sys­tem­atic ten­dency is what you’re re­ally ob­serv­ing when you con­tem­plate a pain and ob­serve that it is “un­de­sir­able”, that it is some­thing you want to avoid.

    The sys­tem­atic ten­dency I re­fer to in­cludes, cru­cial­ly, the ten­dency to plan to achieve pos­i­tively val­ued states (and then to carry out the plan), or to plan the avoid­ance of neg­a­tively val­ued states. In con­trast, for ex­am­ple, sneez­ing is an in­sis­tent re­sponse to cer­tain stim­uli; yet de­spite the strength of the urge—s­neez­ing can be very hard to sup­press—we do not re­gard the sen­sa­tion of sneez­ing as strongly plea­sur­able (nor the in­cip­i­en­t-s­neeze tin­gle, sub­se­quently ex­tin­guished by the sneeze, as strongly un­pleas­an­t). The differ­ence, I pro­pose, is that noth­ing in our ma­chin­ery in­clines us to plan our way into sit­u­a­tions that make us sneeze (and noth­ing strongly in­clines us to plan the avoid­ance of an oc­ca­sional in­cip­i­ent sneeze) for the sake of achiev­ing the sneeze (or avoid­ing the in­cip­i­ent sneeze); the ma­chin­ery just is­n’t wired up to treat sneezes that way (nor should it be). The sen­sa­tions we deem plea­sur­able or painful are those that in­cline us to plan our way to them or away from them, other things be­ing equal.

  22. This is not about dopamin­er­gic effects be­ing re­ward­ing them­selves, but about the per­cep­tion of cur­rent tasks vs al­ter­na­tive tasks. (After all, stim­u­lants don’t sim­ply make you en­joy star­ing at a wall while do­ing noth­ing.) If every­thing be­comes more re­ward­ing, then there is less to gain from switch­ing, be­cause al­ter­na­tives will be es­ti­mated as lit­tle more re­ward­ing; or, if re­ward sen­si­tiv­ity is boosted only for cur­rent ac­tiv­i­ties, then there will be pres­sure against switch­ing tasks, be­cause it is un­likely that al­ter­na­tives will be pre­dicted to be more re­ward­ing than the cur­rent task.↩︎

  23. “On the Sci­en­tific Ways of Treat­ing Nat­ural Law”, Hegel 1803↩︎

  24. pg171–172; re­search on the pig in­volved par­a­lyz­ing it & ap­ply­ing slight con­sis­tent pres­sure for 5–7h to spots, which was enough to trig­ger in­flam­ma­tion & kill hair on the spots.↩︎