Markets/evolution as backstops/ground truths for reinforcement learning/optimization: on some connections between Coase's theory of the firm/linear optimization/DRL/evolution/multicellular life/pain as multi-level optimization problems.
6 Dec 2018–1 Dec 2019 finished certainty: possible importance: 7
- Asymptotics Ascendant
- Man Proposes, God Disposes
- “Pain Is the Only School-Teacher”
- The Perpetual Peace
- See Also
- External Links
One defense of free markets notes the inability of non-market mechanisms to solve planning & optimization problems. This has difficulty with Coase’s paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems are solved. Expanding on some Cosma Shalizi comments, I suggest interpreting phenomenon as multi-level nested optimization paradigm: many systems can be usefully described as having two (or more) levels where a slow sample-inefficient but ground-truth ‘outer’ loss such as death, bankruptcy, or reproductive fitness, trains & constrains a fast sample-efficient but possibly misguided ‘inner’ loss which is used by learned mechanisms such as neural networks or linear programming group selection perspective. So, one reason for free-market or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms. I illustrate by discussing corporations, multicellular life, reinforcement learning & meta-learning in AI, and pain in humans. This view suggests that are inherent balances between market/non-market mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.
In Coase’s theory of the firm, a paradox is noted: idealized competitive markets are optimal for allocating resources and making decisions to reach efficient outcomes, but each market is made up of participants such as large multinational mega-corporations which are not internally made of markets and make their decisions by non-market mechanisms, even things which could clearly be outsourced. In an oft-quoted and amusing passage, Herbert Simon dramatizes the actual situation:
Suppose that [“a mythical visitor from Mars”] approaches the Earth from space, equipped with a telescope that revels social structures. The firms reveal themselves, say, as solid green areas with faint interior contours marking out divisions and departments. Market transactions show as red lines connecting firms, forming a network in the spaces between them. Within firms (and perhaps even between them) the approaching visitor also sees pale blue lines, the lines of authority connecting bosses with various levels of workers. As our visitors looked more carefully at the scene beneath, it might see one of the green masses divide, as a firm divested itself of one of its divisions. Or it might see one green object gobble up another. At this distance, the departing golden parachutes would probably not be visible. No matter whether our visitor approached the United States or the Soviet Union, urban China or the European Community, the greater part of the space below it would be within green areas, for almost all of the inhabitants would be employees, hence inside the firm boundaries. Organizations would be the dominant feature of the landscape. A message sent back home, describing the scene, would speak of “large green areas interconnected by red lines.” It would not likely speak of “a network of red lines connecting green spots.”…When our visitor came to know that the green masses were organizations and the red lines connecting them were market transactions, it might be surprised to hear the structure called a market economy. “Wouldn’t ‘organizational economy’ be the more appropriate term?” it might ask.
A free competitive market is a weighing machine, not a thinking machine; it weighs & compares proposed buys & sells made by participants, and reaches a clearing price. But where, then, do the things being weighed come from? Market participants are themselves not markets, and to appeal to the wisdom of the market is buck-passing; if markets ‘elicit information’ or ‘incentivize performance’, how is that information learned and expressed, and where do the actual actions which yield higher performance come from? At some point, someone has to do some real thinking. (A company can outsource its janitors to the free market, but then whatever contractor is hired still has to decide exactly when and where and how to do the janitor-ing; safe to say, it does not hold an internal auction among its janitors to divide up responsibilities and set their schedules.)
The paradox is that free markets appear to depend on entities which are internally run as totalitarian command dictatorships. One might wonder why there is such a thing as a firm, instead of everything being accomplished by exchanges among the most atomic unit (currently) possible, individual humans. Coase’s suggestion is that it is a principal-agent problem: there’s risk, negotiation costs, trade secrets, betrayal, and having a difference between the principal and agent at all can be too expensive & have too much overhead.
An alternative perspective comes from the socialist calculation debate: why have a market at all, with all its waste and competition, if a central planner can plan out optimal allocations and simply decree it? Cosma Shalizi in a review1 of Spufford’s Red Plenty (which draws on Planning Problems in the USSR: The Contribution of Mathematical Economics to their Solution 1960–1971, ed Ellman 1973), discusses the history of linear optimization algorithms, which were also developed in Soviet Russia under Leonid Kantorovich and used for economics planning. One irony (which Shalizi ascribes to Stiglitz) is that under the same theoretical conditions in which markets could lead to an optimal outcome, so too could a linear optimization algorithm. In practice, of course, the Soviet economy couldn’t possibly be run that way because it would require optimizing over millions or billions of variables, requiring unfathomable amounts of computing power.
As it happens, we now have unfathomable amounts of computing power. What was once a modus tollens is now just a modus ponens.
Corporations, and tech companies in particular as the leading edge, routinely solve planning problems for logistics like fleets of cars or datacenter optimization involving millions of variables; the similar SAT solvers are ubiquitous in computer security research for modeling large computer codebases to verify safety or discover vulnerabilities; most robots couldn’t operate without constantly solving & optimizing enormous systems of equations. The internal planned ‘economies’ of tech companies have grown kudzu-like, sprouting ever larger datasets to predict and automated analyses to plan and market designs to control. The problems solved by retailers like Walmart or Target are world-sized.2 (‘“We are not setting the price. The market is setting the price”, he says. “We have algorithms to determine what that market is.”’) The motto of a Google or Amazon or Uber might be (to paraphrase Freeman Dyson’s paraphrase of John von Neumann in Infinite in All Directions, 1988): “All processes that are stable we shall plan. All processes that are unstable we shall compete in (for now).” Companies may use some limited internal ‘markets’ as useful metaphors for allocation, and dabble in prediction markets, but the internal dynamics of tech companies bear little resemblance to competitive free markets, and show little sign of moving in market-ward directions.
The march of planning also shows little sign of stopping. Uber is not going to stop using historical forecasts of demand to move around drivers to meet expected demand and optimize trip trajectories; datacenters will not stop using linear solvers to allocate running jobs to machines in an optimal manner to minimize electricity consumption while balancing against latency and throughput, in search of a virtuous cycle culminating in the optimal route, “the perpetual trip, the trip that never ends”; ‘markets’ like smartphone walled gardens rely ever more each year on algorithms parsing human reviews & binaries & clicks to decide how to rank or push advertising and conduct multi-armed bandit exploration of options; and so on endlessly.
So, can we run a economy with scaled-up planning approaching 100% centralization, while increasing efficiency and even outcompeting free capitalism-style competitive markets, as Cockshott & Cottrell propose (a proposal occasionally revived in pop socialism like The People’s Republic of Walmart: How the World’s Biggest Corporations are Laying the Foundation for Socialism)?
Let’s look at some more examples:
- corporations and growth
- humans, brains, and cells
- meta-learning in AI (particularly RL)
The striking thing about corporations improving is that they don’t; corporations don’t evolve (see the Price equation & multi-level selection, which can be applied to many things). The business world would look completely different if they did! Despite large differences in competency between corporations, the best corporations don’t simply ‘clone’ themselves and regularly take over arbitrary industries with their superior skills, only to eventually succumb to their mutant offspring who have become even more efficient.
We can copy the best software algorithms, like AlphaZero, indefinitely and they will perform as well as the original, and we can tweak them in various ways to make them steadily better (and this is in fact how many algorithms are developed, by constant iteration); species can reproduce themselves, steadily evolving to ever better exploit their niches, not to mention the power of selective breeding programs; individual humans can refine teaching methods and transmit competence (calculus used to be reserved for the most skilled mathematicians, and now is taught to ordinary high school students, and chess grandmasters have become steadily younger with better & more intensive teaching methods like chess engines); we could even clone exceptional individuals to get more similarly talented individuals, if we really wanted to. But we don’t see this happen with corporations. Instead, despite desperate struggles to maintain “corporate culture”, companies typically coast along, getting more and more sluggish, failing to spin off smaller companies as lean & mean as they used to be, until conditions change or random shocks or degradation finally do them in, such as perhaps some completely-unrelated company (sometimes founded by a complete outsider like a college student) eating their lunch.
Why do we not see exceptional corporations clone themselves and take over all market segments? Why don’t corporations evolve such that all corporations or businesses are now the hyper-efficient descendants of a single ur-corporation 50 years ago, all other corporations having gone extinct in bankruptcy or been acquired? Why is it so hard for corporations to keep their “culture” intact and retain their youthful lean efficiency, or if avoiding ‘aging’ is impossible, why copy themselves or otherwise reproduce to create new corporations like themselves? Instead, successful large corporations coast on inertia or market failures like regulatory capture/monopoly, while successful small ones worry endlessly about how to preserve their ‘culture’ or how to ‘stay hungry’ or find a replacement for the founder as they grow, and there is constant turnover. The large corporations function just well enough that maintaining their existence is an achievement3.
Evolution & the Price equation requires 3 things: entities which can replicate themselves; variation of entities; and selection on entities. Corporations have variation, they have selection—but they don’t have replication.
Corporations certainly undergo selection for kinds of fitness, and do vary a lot. The problem seems to be that corporations cannot replicate themselves. They can set up new corporations, yes, but that’s not necessarily replicating themselves—they cannot clone themselves the way a bacteria can. When a bacteria clones itself, it has… a clone, which is difficult to distinguish in any way from the ‘original’. In sexual organisms, children still resemble their parents to a great extent. But when a large corporation spins off a division or starts a new one, the result may be nothing like the parent and completely lack any secret sauce. A new acquisition will retain its original character and efficiencies (if any). A corporation satisfies the Peter Principle by eventually growing to its level of incompetence, which is always much smaller than ‘the entire economy’. Corporations are made of people, not interchangeable easily-copied widgets or strands of DNA. There is no ‘corporate DNA’ which can be copied to create a new one just like the old. The corporation may not even be able to ‘replicate’ itself over time, leading to scleroticism and aging—but this then leads to underperformance and eventually selection against it, one way or another. So, an average corporation appears little more efficient, particularly if we exclude any gains from new technologies, than an average corporation 50 years ago, and the challenges and failures of the rare multinational corporation 500 years ago like the Medici bank look strikingly similar to challenges and failures of banks today.
We can see a similar problem with other large-scale human organizations: ‘cultures’. An idea seen sometimes is that cultures undergo selection & evolution, and as such, are made up of adaptive beliefs/practices/institutions, which no individual understands (such as farming practices optimally tailored to local conditions); even apparently highly irrational & wasteful traditional practices may actually be an adaptive evolved response, which is optimal in some sense we as yet do not appreciate (sometimes linked to “Chesterton’s fence” as an argument for status quo-ism).
This is not a ridiculous position, since occasionally certain traditional practices have been vindicated by scientific investigation, but the lenses of multilevel selection as defined by the Price equation shows there are serious quantitative issues with this: cultures or groups are rarely driven extinct, with most large-scale ones persisting for millennia; such ‘natural selection’ on the group-level is only tenuously linked to the many thousands of distinct practices & beliefs that make up these cultures; and these cultures mutate rapidly as fads and visions and stories and neighboring cultures and new technologies all change over time (compare the consistency of folk magic/medicine over even small geographic regions, or in the same place over several centuries). For most things, ‘traditional culture’ is simply flatout wrong and harmful and all forms are mutually contradictory, not verified by science, and contains no useful information, and—contrary to “Chesterton’s fence”—the older and harder it is to find a rational basis for a practice, the less likely it is to be helpful:
Chesterton’s meta-fence: “in our current system (democratic market economies with large governments) the common practice of taking down Chesterton fences is a process which seems well established and has a decent track record, and should not be unduly interfered with (unless you fully understand it)”.
The existence of many erroneous practices, and the successful diffusion of erroneous ones, is acknowledged by proponents of cultural evolution like Heinrich (eg Heinrich provides several examples which are comparable to genetic drift spreading harmful mutations), so the question here is one of emphasis or quantity: is the glass 1% full or 99% empty? It’s worth recalling the conditions for human expertise (Armstrong 2001, Principles of Forecasting; Tetlock 2005, Expert Political Judgment: How Good Is It? How Can We Know?; ed Ericsson 2006, The Cambridge Handbook of Expertise and Expert Performance; Kahneman & Klein 2009): repeated practice with quick feedback on objective outcomes in unchanging environments; these conditions are satisfied for relatively few human activities, which are more often rare, with long-delayed feedback, left to quite subjective appraisals mixed in with enormous amounts of randomness & consequences of many other choices before/after, and subject to potentially rapid change (and the more so the more people are able to learn). In such environments, people are more likely to fail to build expertise, be fooled by randomness, and construct elaborate yet erroneous theoretical edifices of superstition (like Tetlock’s hedgehogs). Evolution is no fairy dust which can overcome these serious inferential problems, which are why reinforcement learning is so hard.4
For something like farming, with regular feedback, results which are enormously important to both individual and group survival, and relatively straightforward mechanistic cause-and-effect relationships, it is not surprising that practices tend to be somewhat optimized (although still far from optimal, as enormously increased yields in the Industrial Revolution demonstrate, in part by avoiding the errors of traditional agriculture & using simple breeding techniques)5 ; but none of that applies to ‘traditional medicine’, dealing as it does with complex self-selection, regression to the mean, and placebo effects, where aside from the simplest cases like setting broken bones (again, straightforward, with cause-and-effect relationship), hardly any of it works6 and one is lucky if a traditional remedy is merely ineffective rather than outright poisonous, and in the hardest cases like snake bites, it would be better to wait for death at home than waste time going to the local witch doctor.
So—just like corporations—‘selection’ of cultures happens rarely with each ‘generation’ spanning centuries or millennia, typically has little to do with how reality-based their beliefs tend to be (for a selection coefficient approaching zero), and if one culture did in fact consume another one thanks to more useful beliefs about some herb, it is likely to backslide under the bombardment of memetic mutation (so any selection is spent just purging mutations, creating a mutation-selection balance); under such conditions, there will be little long-term ‘evolution’ towards higher optima, and the information content of culture will be minimal and closely constrained to only the most universal, high-fitness-impact, and memetically-robust aspects.
“Individual organisms are best thought of as adaptation-executers rather than as fitness-maximizers. Natural selection cannot directly ‘see’ an individual organism in a specific situation and cause behavior to be adaptively tailored to the functional requirements imposed by that situation.”
Tooby & Cosmides 1992, “The Psychological Foundations of Culture”
Contrast that with a human. Despite ultimately being designed by evolution, evolution then plays no role at ‘runtime’ and more powerful learning algorithms take over.
With these more powerful algorithms designed by the meta-algorithm of evolution, a human is able to live successfully for over 100 years, with tremendous cooperation between the trillions of cells in their body, only rarely breaking down towards the end with a small handful of seed cancer cells defecting over a lifetime despite even more trillions of cell divisions and replacements. They are also able to be cloned, yielding identical twins so similar across the board that people who know them may be unable to distinguish them. And they don’t need to use evolution or markets to develop these bodies, instead, relying on a complex hardwired developmental program controlled by genes which ensures that >99% of humans get the two pairs of eyes, lungs, legs, brain hemispheres etc that they need. Perhaps the most striking efficiency gain from a human is the possession of a brain with the ability to predict the future, learn highly abstract models of the world, and plan and optimize over these plans for objectives which may only relate indirectly to fitness decades from now or fitness-related events which happen less than once in a lifetime & are usually unobserved or fitness events like that of descendants which can never be observed.
Let’s put it another way.
Imagine trying to run a business in which the only feedback given is whether you go bankrupt or not. In running that business, you make millions or billions of decisions, to adopt a particular model, rent a particular store, advertise this or that, hire one person out of scores of applicants, assign them this or that task to make many decisions of their own (which may in turn require decisions to be made by still others), and so on, extended over many years. At the end, you turn a healthy profit, or go bankrupt. So you get 1 bit of feedback, which must be split over billions of decisions. When a company goes bankrupt, what killed it? Hiring the wrong accountant? The CEO not investing enough in R&D? Random geopolitical events? New government regulations? Putting its HQ in the wrong city? Just a generalized inefficiency? How would you know which decisions were good and which were bad? How do you solve the “credit assignment problem”?
Ideally, you would have some way of tracing back every change in the financial health of a company back to the original decision & the algorithm which made that decision, but of course this is impossible since there is no way to know who said or did what or even who discussed what with whom when. There would seem to be no general approach other than the truly brute force one of evolution: over many companies, have some act one way and some act another way, and on average, good decisions will cluster in the survivors and not-so-good decisions will cluster in the deceased. ‘Learning’ here works (under certain conditions—like sufficiently reliable replication—which in practice may not obtain) but is horrifically expensive & slow.
In RL, this would correspond to black box/gradient-free methods, particularly evolutionary methods. For example, Salimans et al 2017 uses an evolutionary method in which thousands of slightly-randomized neural networks play an Atari game simultaneously, and at the end of the games, a new average neural network is defined based on the performance of them all; no attempt is made to figure out which specific changes are good or bad or even to get a reliable estimate—they simply run and the scores are what they are. If we imagine a schematic like ‘models → model parameters → environments → decisions → outcomes’, evolution collapses it to just ‘models → outcomes’; feed a bunch of possible models in, get back outcomes, pick the models with best outcomes.
A more sample-efficient method would be something like REINFORCE, which Andrej Karpathy explains with an ALE Pong agent; what does REINFORCE do to crack the black box open a little bit? It’s still horrific and amazing that it works:
So here is how the training will work in detail. We will initialize the policy network with some W1, W2 and play 100 games of Pong (we call these policy “rollouts”). Lets assume that each game is made up of 200 frames so in total we’ve made 20,000 decisions for going
DOWNand for each one of these we know the parameter gradient, which tells us how we should change the parameters if we wanted to encourage that decision in that state in the future. All that remains now is to label every decision we’ve made as good or bad. For example suppose we won 12 games and lost 88. We’ll take all decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backprop, and parameter update encouraging the actions we picked in all those states). And we’ll take the other decisions we made in the losing games and do a negative update (discouraging whatever we did). And… that’s it. The network will now become slightly more likely to repeat actions that worked, and slightly less likely to repeat actions that didn’t work. Now we play another 100 games with our new, slightly improved policy and rinse and repeat.
Policy Gradients: Run a policy for a while. See what actions led to high rewards. Increase their probability.
If you think through this process you’ll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn’t that discourage the correct bounce on frame 50? You’re right—it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you’ll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.
…I did not tune the hyperparameters too much and ran the experiment on my (slow) Macbook, but after training for 3 nights I ended up with a policy that is slightly better than the AI player. The total number of episodes was approximately 8,000 so the algorithm played roughly 200,000 Pong games (quite a lot isn’t it!) and made a total of ~800 updates.
The difference here from evolution is that the credit assignment is able to use backpropagation to reach into the NN and directly adjust their contribution to the decision which was ‘good’ or ‘bad’; the difficulty of tracing out the consequences of each decision and labeling it ‘good’ is simply bypassed with the brute force approach of decreeing that all actions taken in a ultimately-successful game were good, and all of them were bad if the game is ultimately bad. Here we optimize something more like ‘model parameters → decisions → outcomes’; we feed parameters in to get out decisions which then are assumed to cause the outcome, and reverse it to pick the parameters with the best outcomes.
This is still crazy, but it works, and better than simple-minded evolution: Salimans et al 2017 compares their evolution method to more standard methods which are fancier versions of the REINFORCE policy gradient approach, and this brutally limited use of backpropagation for credit assignment still cuts the sample size by 3–10x, and more on more difficult problems.
Can we do better? Of course. It is absurd to claim that all actions in a game determine the final outcome, since the environment itself is stochastic and many decisions are either irrelevant or were the opposite in true quality of whatever the outcome was. To do better, we can connect the decisions to the environment by modeling the environment itself as a white box which can be cracked open & analyzed, using a model-based RL approach like the well-known PILCO.
In PILCO, a model of the environment is learned by a powerful model (the non-neural-network Gaussian process, in this case), and the model is used to do planning: start with a series of possible actions, run them through the model to predict what would happen, and directly optimize the actions to maximize the reward. The influence of the parameters of the model causing the chosen actions, which then partially cause the environment, which then partially cause the reward, can all be traced from the final reward back to the original parameters. (It’s white boxes all the way down.) Here the full ‘models → model parameters → environments → decisions → outcomes’ pipeline is expressed and the credit assignment is performed correctly & as a whole.
The result is state-of-the-art sample efficiency: in a simple problem like Cartpole, PILCO can solve it within as little as 10 episodes, while standard deep reinforcement learning approaches like policy gradients can struggle to solve it within 10,000 episodes.
The problem, of course, with model-based RL such as PILCO is that what they gain in correctness & sample-efficiency, they give back in computational requirements: I can’t compare PILCO’s sample-efficiency with Salimans et al 2017’s ALE sample-efficiency or even Karpathy’s Pong sample-efficiency because PILCO simply can’t be run on problems all that much more complex than Cartpole.
So we have a painful dilemma: sample-efficiency can be many orders of magnitude greater than possible with evolution, if only one could do more precise fine-grained credit assignment—instead of judging billions of decisions based solely on a single distant noisy binary outcome, the algorithm generating each decision can be traced through all of its ramifications through all subsequent decisions & outcomes to a final reward—but these better methods are not directly applicable. What to do?
“…the spacing that has made for the most successful inductions will have tended to predominate through natural selection. Creatures inveterately wrong in their inductions have a pathetic but praiseworthy tendency to die before reproducing their kind….In induction nothing succeeds like success.”
Speaking of evolutionary algorithms & sample-efficiency, an interesting area of AI and reinforcement learning is “meta-learning”, usually described as “learning to learn” (Botvinick et al 2019). This rewrites a given learning task as a two-level problem, where one seeks a meta-algorithm for a family of problems which then adapts at runtime to the specific problem at hand. (In evolutionary terms, this could be seen as related to a Baldwin effect.) There are many paradigms in meta-learning using various kinds of learning & optimizers; for listing of several recent ones, see Table 1 of Metz et al 2018 (reproduced in an appendix).
For example, one could train an RNN on a ‘left or right’ T-maze task where the direction with the reward switches at random every once in a while: the RNN has a memory, its hidden state, so after trying the left arm a few times and observing no reward, it can encode “the reward has switched to the right”, and then decide to go right every time while continuing to encode how many failures it’s had after the switch; when the reward then switches back to the left, after a few failures on the right, the learned rule will fire and it’ll switch back to the left. Without this sequential learning, if it was just trained on a bunch of samples, where half the ‘lefts’ have a reward and half the ‘rights’ also have a reward (because of the constant switching), it’ll learn a bad strategy like picking a random choice 50-50, or always going left/right. Another approach is ‘fast weights’, where a starting meta-NN observes a few datapoints from a new problem, and then emits the adjusted parameters for a new NN, specialized to the problem, which is then run exactly and receives a reward, so the meta-NN can learn to emit adjusted parameters which will achieve high reward on all problems. A version of this might be the MAML meta-learning algorithms (Finn et al 2017) where a meta-NN is learned which is carefully balanced between possible NNs so that a few finetuning steps of gradient descent training within a new problem ‘specializes’ it to that problem (one might think of the meta-NN as being a point in the high-dimensional model space which is roughly equidistant from a large number of NNs trained on each individual problem, where tweaking a few parameters controls overall behavior and only those need to be learned from the initial experiences). In general, meta-learning enables learning of the superior Bayes-optimal agent within environments by inefficient (possibly not even Bayesian) training across environments (Ortega et al 2019). As Duff 2002 puts it, “One way of thinking about the computational procedures that I later propose is that they perform an offline computation of an online, adaptive machine. One may regard the process of approximating an optimal policy for the Markov decision process defined over hyper-states as ‘compiling’ an optimal learning strategy, which can then be ‘loaded’ into an agent.”
An interesting example of this approach is the DeepMind paper Jaderberg et al 2018, which presents a Quake team FPS agent trained using a two-level approach (and Leibo et al 2018 which extends it further with multiple populations; for background, see Sutton & Barto 2018; for an evolutionary manifesto, see Leibo et al 2019), an approach which was valuable for their AlphaStar StarCraft II agent publicized in January 2019. The FPS game is a multiplayer capture-the-flag match where teams compete on a map, rather than the agent controlling a single agent in a death-match setting; learning to coordinate, as well as explicitly communicate, with multiple copies of oneself is tricky and normal training methods don’t work well because updates change all the other copies of oneself as well and destabilize any communication protocols which have been learned. What Jaderberg does is use normal deep RL techniques within each agent, predicting and receiving rewards within each game based on earning points for flags/attacks, but then the overall population of 30 agents, after each set of matches, undergoes a second level of selection based on final game score/victory, which then selects on the agent’s internal reward prediction & hyperparameters
This can be seen as a two-tier reinforcement learning problem. The inner optimisation maximises Jinner, the agents’ expected future discounted internal rewards. The outer optimisation of Jouter can be viewed as a meta-game, in which the meta-reward of winning the match is maximised with respect to internal reward schemes wp and hyperparameters φp, with the inner optimisation providing the meta transition dynamics. We solve the inner optimisation with RL as previously described, and the outer optimisation with Population Based Training (PBT) (29). PBT is an online evolutionary process which adapts internal rewards and hyperparameters and performs model selection by replacing under-performing agents with mutated versions of better agents. This joint optimisation of the agent policy using RL together with the optimisation of the RL procedure itself towards a high-level goal proves to be an effective and generally applicable strategy, and utilises the potential of combining learning and evolution (2) in large scale learning systems.
The goal is to win, the ground-truth reward is the win/loss, but learning only from win/loss is extremely slow: a single bit (probably less) of information must be split over all actions taken by all agents in the game and used to train NNs with millions of interdependent parameters, in a particularly inefficient way as one cannot compute exact gradients from the win/loss back to the responsible neurons. Within-game points are a much richer form of supervision, more numerous and corresponding to short time segments, allowing for much more learning within each game (possibly using exact gradients), but are only indirectly related to the final win/loss; an agent could rack up many points on its own while neglecting to fight the enemy or coordinate well and ensuring a final defeat, or it could learn a greedy team strategy which performs well initially but loses over the long run. So the two-tier problem uses the slow ‘outer’ signal or loss function (winning) to sculpt the faster inner loss which does the bulk of the learning. (“Organisms are adaptation-executors, not fitness-maximizers.”) Should the fast inner algorithms not be learning something useful or go haywire or fall for a trap, the outer rewards will eventually recover from the mistake, by mutating or abandoning them in favor of more successful lineages. This combines the crude, slow, dogged optimization of evolution, with the much faster, more clever, but potentially misguided gradient-based optimization, to produce something which will reach the right goal faster. (Two more recent examples would be surrogate/synthetic gradients.)
Cosma Shalizi, elsewhere, enjoys noting formal identities between natural selection and Bayesian statistics (especially particle filtering) and markets, where the population frequency of an allele corresponds to a parameter’s prior probability or starting wealth of a trader, and fitness differentials/profits correspond to updates based on new evidence. (See also Evstigneev et al 2008/Lensberg & Schenk-Hoppé 2006, Campbell 2016, Czégel et al 2019.) While a parameter may start with erroneously low prior, at some point the updates will make the posterior converge on it. (The relationship between populations of individual with noisy fixed beliefs, and Thompson sampling, is also interesting: Krafft 2017. Can we see the apparently-inefficient stream of startups trying ‘failed’ ideas—and occasionally winding up winning big—as a kind of collective Thompson sampling & more efficient than it seems?) And stochastic gradient descent can be seen as secretly an approximation or variational form of Bayesian updates by estimating its gradients (because everything that works works because it’s Bayesian?) and of course evolutionary methods can be seen as calculating finite difference approximations to gradients…
|Evolution||Allele||Population Frequency||Fitness Differential|
|Particle Filtering||Particles||Population Frequency||Accept-Reject Sample|
|SGD||Parameter||Random Initialization||Gradient Step|
This pattern surfaces in our other examples too. This two-level learning is analogous to meta-learning: the outer or meta-algorithm learns how to generate an inner or object-level algorithm which can learn most effectively, better than the meta-algorithm. Inner algorithms themselves can learn better algorithms, and so on, gaining power, compute-efficiency, or sample-efficiency, with every level of specialization. (“It’s optimizers all the way up, young man!”) It’s also analogous to cells in a human body: overall reproductive fitness is a slow signal that occurs only a few times in a lifetime at most, but over many generations, it builds up fast-reacting developmental and homeostatic processes which can build an efficient and capable body and respond to environmental fluctuations within minutes rather than millennia, and the brain is still superior with split-second situations. It’s also analogous to corporations in a market: the corporation can use whatever internal algorithms it pleases, such as linear optimization or neural networks, and evaluate them internally using internal metrics like “number of daily users”; but eventually, this must result in profits…
The central problem a corporation solves is how to motivate, organize, punish & reward its sub-units and constituent humans in the absence of direct end-to-end losses without the use of slow external market mechanisms. This is done by tapping into social mechanisms like peer esteem (soldiers don’t fight for their country, they fight for their buddies), selecting workers who are intrinsically motivated to work usefully rather than parasitically, constant attempts to instill a “company culture” with sloganeering or handbooks or company songs, use of multiple proxy measures for rewards to reduce Goodhart-style reward hacking, ad hoc mechanisms like stock options to try to internalize within workers the market losses, replacing workers with outsourcing or automation, acquiring smaller companies which have not yet decayed internally or as a selection mechanism (“acquihires”), employing intellectual property or regulation… All of these techniques together can align the parts into something useful to eventually sell…
…Or else the company will eventually go bankrupt:
Great is Bankruptcy: the great bottomless gulf into which all Falsehoods, public and private, do sink, disappearing; whither, from the first origin of them, they were all doomed. For Nature is true and not a lie. No lie you can speak or act but it will come, after longer or shorter circulation, like a Bill drawn on Nature’s Reality, and be presented there for payment,—with the answer, No effects. Pity only that it often had so long a circulation: that the original forger were so seldom he who bore the final smart of it! Lies, and the burden of evil they bring, are passed on; shifted from back to back, and from rank to rank; and so land ultimately on the dumb lowest rank, who with spade and mattock, with sore heart and empty wallet, daily come in contact with reality, and can pass the cheat no further.
…But with a Fortunatus’ Purse in his pocket, through what length of time might not almost any Falsehood last! Your Society, your Household, practical or spiritual Arrangement, is untrue, unjust, offensive to the eye of God and man. Nevertheless its hearth is warm, its larder well replenished: the innumerable Swiss of Heaven, with a kind of Natural loyalty, gather round it; will prove, by pamphleteering, musketeering, that it is a truth; or if not an unmixed (unearthly, impossible) Truth, then better, a wholesomely attempered one, (as wind is to the shorn lamb), and works well. Changed outlook, however, when purse and larder grow empty! Was your Arrangement so true, so accordant to Nature’s ways, then how, in the name of wonder, has Nature, with her infinite bounty, come to leave it famishing there? To all men, to all women and all children, it is now indubitable that your Arrangement was false. Honour to Bankruptcy; ever righteous on the great scale, though in detail it is so cruel! Under all Falsehoods it works, unweariedly mining. No Falsehood, did it rise heaven-high and cover the world, but Bankruptcy, one day, will sweep it down, and make us free of it.7
A large corporation like Sears may take decades to die (“There is a great deal of ruin in a nation”, Adam Smith observed), but die it does. Corporations do not increase in performance rapidly and consistently the way selective breeding or AI algorithms do because they cannot replicate themselves as exactly as digital neural networks or biological cells can, but, nevertheless, they are still part of a two-tier process where a ground-truth uncheatable outer loss constrains the internal dynamics to some degree and maintain a baseline or perhaps modest improvement over time. The plan is “checked”, as Trotsky puts it in criticizing Stalin’s policies like abandoning the NEP, by supply and demand:
If a universal mind existed, of the kind that projected itself into the scientific fancy of Laplace—a mind that could register simultaneously all the processes of nature and society, that could measure the dynamics of their motion, that could forecast the results of their inter-reactions—such a mind, of course, could a priori draw up a faultless and exhaustive economic plan, beginning with the number of acres of wheat down to the last button for a vest. The bureaucracy often imagines that just such a mind is at its disposal; that is why it so easily frees itself from the control of the market and of Soviet democracy. But, in reality, the bureaucracy errs frightfully in its estimate of its spiritual resources.
…The innumerable living participants in the economy, state and private, collective and individual, must serve notice of their needs and of their relative strength not only through the statistical determinations of plan commissions but by the direct pressure of supply and demand. The plan is checked and, to a considerable degree, realized through the market.
Pain is a curious thing. Why do we have painful pain instead of just a more neutral painless pain, when it can backfire so easily as chronic pain, among other problems? Why do we have pain at all instead of regular learning processes or experiencing rewards as we follow plans?
Can we understand pain as another two-level learning process, where a slow but ground-truth outer loss constrains a fast but unreliable inner loss? I would suggest that pain itself is not an outer loss, but the painfulness of pain, its intrusive motivational aspects, is what makes it an outer loss. There is no logical necessity for pain to be pain but this would not be adaptive or practical because it would too easily let the inner loss lead to damaging behavior.
So let’s consider the possibilities when it comes to pain. There isn’t just “pain”. There is (at the least):
useless painful pain (chronic pain, exercise)
useful painful pain (the normal sort)
useful nonpainful nonpain (adrenaline rushes during combat)
and intermediate cases: like the Marsili family who have a genetic mutation (Habib et al 2018) which partially damages pain perception. The Marsilis do feel useful painful pain but only briefly, and incur substantial bodily damage (broken bones, scars) but avoid the most horrific anecdotes of those with deadened nerves or pain asymbolia.
Another interesting case is the Scotswoman Jo Cameron, who has a different set of mutations to her endocannabinoid system (FAAH & FAAH-OUT): while not as bad as neuropathy, she still exhibits similar symptoms—her father who may also have been a carrier died peculiarly, she regularly burns or cuts herself in household chores, she broke her arm roller-skating as a child but didn’t seek treatment, delayed treatment of a damaged hip and then a hand damaged by arthritis until almost too late14, took in foster children who stole her savings, etc. (Biologist Matthew Hill describes the most common FAAH mutation as causing “low levels of anxiety, forgetfulness, a happy-go-lucky demeanor”, and “Since the paper was published, Matthew Hill has heard from half a dozen people with pain insensitivity, and he told me that many of them seemed nuts” compared to Jo Cameron.)
but—is there ‘useful painless pain’ or ‘useless painful nonpain’?
It turns out there is ‘painless pain’: lobotomized people experience that, and “reactive dissociation” is the phrase used to describe the effects sometimes of analgesics like morphine when administered after pain has begun, and the patient reports, to quote Dennett 1978 (emphasis in original), that “After receiving the analgesic subjects commonly report not that the pain has disappeared or diminished (as with aspirin) but that the pain is as intense as ever though they no longer mind it…if it is administered before the onset of pain…the subjects claim to not feel any pain subsequently (though they are not numb or anesthetized—they have sensation in the relevant parts of their bodies); while if the morphine is administered after the pain has commenced, the subjects report that the pain continues (and continues to be pain), though they no longer mind it…Lobotomized subjects similarly report feeling intense pain but not minding it, and in other ways the manifestations of lobotomy and morphine are similar enough to lead some researchers to describe the action of morphine (and some barbiturates) as ‘reversible pharmacological leucotomy [lobotomy]’.23”15
And we can find examples of what appears to be ‘painful nonpain’: Grahek 2001 highlights a case-study, Ploner et al 1999, where the German patient’s somatosensory cortices suffered a lesion from a stroke, leading to an inability to feel heat normally on one side of his body or feel any spots of heat or pain from heat; despite this, when sufficient heat was applied to a single spot on the arm, the patient became increasingly agitated, describing an “clearly unpleasant” feeling associated with his whole arm, but denied any description of it involving crawling skin sensations or common words like “slight pain” or “burning”.
A table might help lay out the possibilities:
|useless||painful||pain||chronic pain, exercise|
|useful||nonpainful||pain||reactive dissociation, lobotomies|
|useless||painful||nonpain||unconscious processes such as anesthesia awareness. Itches or tickles, anterograde amnesia?16|
|useful||painful||nonpain||cold/heat perception, as in the somatosensory cortex lesion case-study|
|useless||nonpainful||nonpain||deadened nerves from diseases (diabetes, leprosy), injury, drugs (anesthetics)|
Pain serves a clear purpose (stopping us from doing things which may cause damage to our bodies), but in an oddly unrelenting way which we cannot disable and which increasingly often backfires on our long-term interests in the form of ‘chronic pain’ and other problems. Why doesn’t pain operate more like a warning, or like hunger or thirst? They interrupt our minds, but like a computer popup dialogue, after due consideration of our plans and knowledge, we can generally dismiss them. Pain is the interruption which doesn’t go away, although (Morsella 2005):
Theoretically, nervous mechanisms could have evolved to solve the need for this particular kind of interaction otherwise. Apart from automata, which act like humans but have no phenomenal experience, a conscious nervous system that operates as humans do but does not suffer any internal strife. In such a system, knowledge guiding skeletomotor action would be isomorphic to, and never at odds with, the nature of the phenomenal state—running across the hot desert sand in order to reach water would actually feel good, because performing the action is deemed adaptive. Why our nervous system does not operate with such harmony is perhaps a question that only evolutionary biology can answer. Certainly one can imagine such integration occurring without anything like phenomenal states, but from the present standpoint, this reflects more one’s powers of imagination than what has occurred in the course of evolutionary history.
In the reinforcement learning context, one could ask: does it make a difference whether one has ‘negative’ or ‘positive’ rewards? Any reward function with both negative and positive rewards could be turned into all-positive rewards simply by adding a large constant. Is that a difference which makes a difference? Or instead of maximizing positive ‘rewards’, one could speak of minimizing ‘losses’, and one often does in economics or decision theory or control theory17.
“Do Artificial Reinforcement-Learning Agents Matter Morally?”, Tomasik 2014, debates the relationship of rewards to considerations of “suffering” or “pain”, given the duality between costs-losses/rewards:
Perhaps the more urgent form of refinement than algorithm selection is to replace punishment with rewards within a given algorithm. RL systems vary in whether they use positive, negative, or both types of rewards:
- In certain RL problems, such as maze-navigation tasks discussed in Sutton and Barto , the rewards are only positive (if the agent reaches a goal) or zero (for non-goal states).
- Sometimes a mix between positive and negative rewards6 is used. For instance, McCallum  put a simulated mouse in a maze, with a reward of 1 for reaching the goal, −1 for hitting a wall, and −0.1 for any other action.
- In other situations, the rewards are always negative or zero. For instance, in the cart-pole balancing system of Barto et al. , the agent receives reward of 0 until the pole falls over, at which point the reward is −1. In Koppejan and Whiteson ’s neuroevolutionary RL approach to helicopter control, the RL agent is punished either a little bit, with the negative sum of squared deviations of the helicopter’s positions from its target positions, or a lot if the helicopter crashes.
Just as animal-welfare concerns may motivate incorporation of rewards rather than punishments in training dogs [Hiby et al., 2004] and horses [Warren-Smith and McGreevy, 2007, Innes and McBride, 2008], so too RL-agent welfare can motivate more positive forms of training for artificial learners. Pearce  envisions a future in which agents are driven by ‘gradients of well-being’ (i.e., positive experiences that are more or less intense) rather than by the distinction between pleasure versus pain. However, it’s not entirely clear where the moral boundary lies between positive versus negative welfare for simple RL systems. We might think that just the sign of the agent’s reward value r would distinguish the cases, but the sign alone may not be enough, as the following section explains.
What’s the boundary between positive and negative welfare?
Consider an RL agent with a fixed life of T time steps. At each time t, the agent receives a non-positive reward as a function of the action that it takes, such as in the pole-balancing example. The agent chooses its action sequence (at) with the goal of maximising the sum of future rewards:
Now suppose we rewrite the rewards by adding a huge positive constant c to each of them, , big enough that all of the are positive. The agent now acts so as to optimise
So the optimal action sequence is the same in either case, since additive constants don’t matter to the agent’s behaviour.7 But if behaviour is identical, the only thing that changed was the sign and numerical magnitude of the reward numbers. Yet it seems absurd that the difference between happiness and suffering would depend on whether the numbers used by the algorithm happened to have negative signs in front. After all, in computer binary, negative numbers have no minus sign but are just another sequence of 0s and 1s, and at the level of computer hardware, they look different still. Moreover, if the agent was previously reacting aversively to harmful stimuli, it would continue to do so. As Lenhart K. Schubert explains:8 [This quotation comes from spring 2014 lecture notes (accessed March 2014) for a course called “Machines and Consciousness”.]
If the shift in origin [to make negative rewards positive] causes no behavioural change, then the robot (analogously, a person) would still behave as if suffering, yelling for help, etc., when injured or otherwise in trouble, so it seems that the pain would not have been banished after all!
So then what distinguishes pleasure from pain?
…A more plausible account is that the difference relates to ‘avoiding’ versus ‘seeking.’ A negative experience is one that the agent tries to get out of and do less of in the future. For instance, injury should be an inherently negative experience, because if repairing injury was rewarding for an agent, the agent would seek to injure itself so as to do repairs more often. If we tried to reward avoidance of injury, the agent would seek dangerous situations so that it could enjoy returning to safety.10 [This example comes from Lenhart K. Schubert’s spring 2014 lecture notes (accessed March 2014), for a course called ‘Machines and Consciousness.’ These thought experiments are not purely academic. We can see an example of maladaptive behaviour resulting from an association of pleasure with injury when people become addicted to the endorphin release of self-harm.]18 Injury needs to be something the agent wants to get as far away from as possible. So, for example, even if vomiting due to food poisoning is the best response you can take given your current situation, the experience should be negative in order to dissuade you from eating spoiled foods again. Still, the distinction between avoiding and seeking isn’t always clear. We experience pleasure due to seeking and consuming food but also pain that motivates us to avoid hunger. Seeking one thing is often equivalent to avoiding another. Likewise with the pole-balancing agent: Is it seeking a balanced pole, or avoiding a pole that falls over?
…Where does all of this leave our pole-balancing agent? Does it suffer constantly, or is it enjoying its efforts? Likewise, is an RL agent that aims to accumulate positive rewards having fun, or is it suffering when its reward is suboptimal?
So with all that for background, what is the purpose of pain?
The purpose of pain, I would say, is as a ground truth or outer loss. (This is a motivational theory of pain with a more sophisticated RL/psychiatric grounding.)
The pain reward/loss cannot be removed entirely for the reasons demonstrated by the diabetics/lepers/congenital insensitives: the unnoticed injuries and the poor planning are ultimately fatal. Without any pain qualia to make pain feel painful, we will do harmful things like run on a broken leg or jump off a roof to impress our friends19, or just move in a not-quite-right fashion and a few years later wind up paraplegics. (An intrinsic curiosity drive alone would interact badly with a total absence of painful pain: after all, what is more novel or harder to predict than the strange and unique states which can be reached by self-injury or recklessness?)
If pain couldn’t be removed, could pain be turned into a reward, then? Could we be the equivalent of Morsella’s mind that doesn’t experience pain, as it infers plans and then executes them, experiencing only more or less rewards? It only experience positive rewards (pleasure) as it runs across burning-hot sands, as this is the optimal action for it to be taking according to whatever grand plan it has thought of.
Perhaps we could… but what stops Morsella’s mind from enjoying rewards by literally running in circles on those sands until it dies or is crippled? Morsella’s mind may make a plan and define a reward function which avoids the need for any pain or negative rewards, but what happens if there is any flaw in the computed plan or the reward estimates? Or if the plan is based on mistaken premises? What if the sands are hotter than expected, or if the distance is much further than expected, or if the final goal (perhaps an oasis of water) is not there? Such a mind raises serious questions about learning and dealing with errors: what does such a mind experience when a plan fails? Does it experience nothing? Does it experience a kind of “meta-pain”?
Consider what Brand (The Gift of Pain again, pg191–197) describes as the ultimate cause of the failure of years of research into creating ‘pain prosthetics’, computerized gloves & socks that would measure heat & pressure in real-time in order to warn those without pain like lepers or diabetics: the patients would just ignore the warnings, because stopping to prevent future problems was inconvenient while continuing paid off now. And when electrical shockers were added to the system to stop them from doing a dangerous thing, Brand observed patients simply disabling it to do the dangerous thing & re-enabling it afterwards!
What pain provides is a constant, ongoing feedback which anchors all the estimates of future rewards based on planning or bootstrapping. It anchors our intelligence in a concrete estimation of bodily integrity: the intactness of skin, the health of skin cells, the lack of damage to muscles, joints sliding and moving as they ought to, and so on. If we are planning well and acting efficiently in the world, we will, in the long run, on average, experience higher levels of bodily integrity and physical health; if we are learning and choosing and planning poorly, then… we won’t. The badness will gradually catch up with us and we may find ourselves blind scarred paraplegics missing fingers and soon to die. A pain that was not painful would not serve this purpose, as it would merely be another kind of “tickling” sensation. (Some might find it interesting or enjoyable or it could accidentally become sexually-linked.) The perceptions in question are simply more ordinary tactile, kinesthetic, thermoreceptor, or other standard categories of perception; without painful pain, a fire burning your hand simply feels warm (before the thermal-perceptive nerves are destroyed and nothing further is felt), and a knife cutting flesh might feel like a rippling stretching rubbing movement.
We might say that a painful pain is a pain which forcibly inserts itself into the planning/optimization process, as a cost or lack of reward to be optimized. A pain which was not motivating is not what we mean by ‘pain’ at all.20 The motivation itself is the qualia of pain, much like an itch is an ordinary sensation coupled with a motivational urge to scratch. Any mental quality or emotion or sensation which is not accompanied by a demandingness, an involuntary taking-into-consideration, is not pain. The rest of our mind can force its way through pain, if it is sufficiently convinced that there is enough reason to incur the costs of pain because the long-term reward is so great, and we do this all the time: we can convince ourselves to go to the gym, or withstand the vaccination needle, or, in the utmost extremity, saw off a trapped hand to save our life. And if we are mistaken, and the predicted rewards do not arrive, eventually the noisy constant feedback of pain will override the decisions leading to pain, and whatever incorrect beliefs or models led to the incorrect decisions will be adjusted to do better in the future.
But the pain cannot and must not be overridden: human organisms can’t be trusted to simply ‘turn off’ pain and indulge an idle curiosity about cutting off hands. We are insufficiently intelligent, our priors insufficiently strong, our reasoning and planning too poor, and we must do too much learning within each life to do without pain.
A similar argument might apply to the puzzle of ‘willpower’, ‘procrastination’. Why do we have such problems, particularly in a modern context, doing aught we know we should and doing naught we oughtn’t?
On the grave of the ‘blood glucose’ level theory, Kurzban et al 2013 (see later Shenhav et al 2017) erects an opportunity cost theory of willpower. Since objective physical measurements like blood glucose levels fail to mechanically explain poorer brain functionality, similar to the failure of objective physical measurements like lactate levels to explain why people are able to physically exercise only a certain amount (despite being able to exercise far more if properly motivated or if tricked), the reason for willpower running out must be subjective.
The lack of willpower is a heuristic which doesn’t require the brain to explicitly track & prioritize & schedule all possible tasks, by forcing it to regularly halt tasks—“like a timer that says, ‘Okay you’re done now.’” If one could override fatigue at will and do things like cycle for thousands of miles like ultra-endurance cyclist Jure Robič, the physical consequences would be severe, such as elaborate hallucinations (and incidentally, Robič was eventually run over while cycling). The ‘timer’ is implemented, among other things, as a gradual buildup of adenosine, which creates sleep homeostatic drive pressure and possibly physical fatigue during exercise (Noakes 2012, Martin et al 2018), leading to a gradually increasing subjectively perceived ‘cost’ of continuing with a task/staying awake/continuing athletic activities, which resets when one stops/sleeps/rests.
To explain the sugar-related observations, Kurzban et al 2013 suggest that the aversiveness of long focus and cognitive effort is a simple heuristic which creates a baseline cost to focusing for ‘too long’ on any one task, to the potential neglect of other opportunities, with the sugar interventions (such as merely tasting sugar water) which appear to boost willpower actually serving as proximate reward signals (signals, because the actual energetic content is nil, and cognitive effort doesn’t meaningfully burns calories in the first place), which justify to the underlying heuristic that further effort on the same task is worthwhile and the opportunity cost is minimal.
Since the human mind is too limited in its planning and monitoring ability, it cannot be allowed to ‘turn off’ opportunity cost warnings and engage in hyperfocus on potentially useless things at the neglect of all other things; procrastination here represents a psychic version of pain. From this perspective, it is not surprising that so many stimulants are adenosinergic or dopaminergic21, or that many anti-procrastination strategies boil down to optimizing for more rewards or more frequent rewards (eg breaking tasks down into many smaller tasks, which can be completed individually & receive smaller but more frequent rewards, or thinking more clearly about whether something is worth doing): all of these would affect the reward perception itself, and reduce the baseline opportunity cost ‘pain’. This perspective may also shed light on occupational burnout and why restorative hobbies are ideally maximally different from jobs and more miscellaneous observations like the lower rate of ‘hobbies’ outside the West: burnout may be a long-term homeostatic reaction to spending ‘too much’ time too frequently on a difficult not-immediately rewarding task despite earlier attempts to pursue other opportunities, which were always overridden, ultimately resulting in a total collapse; and hobbies ought to be as different in location and physical activity and social structure (eg a solitary programmer indoors should pursue a social physical activity outdoors) to ensure that it feels completely different for the mind than the regular occupation; and in places with less job specialization or fewer work-hours, the regular flow of a variety of tasks and opportunities means that no such special activity as a ‘hobby’ is necessary.
Perhaps if we were superintelligent AIs who could trivially plan flawless humanoid locomotion at 1000Hz taking into account all possible damages, or if we were emulated brains sculpted by endless evolutionary procedures to execute perfectly adaptive plans by pure instinct, or if we were simple amoeba in a Petri dish who had no real choices to make, there would be no need for a pain which was painful. And likewise, were we endlessly planning and replanning to the end of days, we should never experience akrasia, we should merely do what is necessary (perhaps not even experiencing any qualia of effort or deliberation, merely seeing events endlessly unfold as they always had to). But we are not. The pain keeps us honest. In the end, pain is our only teacher.
“These laws, taken in the largest sense, being Growth with Reproduction; Inheritance which is almost implied by reproduction; Variability from the indirect and direct action of the external conditions of life, and from use and disuse; a Ratio of Increase so high as to lead to a Struggle for Life, and as a consequence to Natural Selection, entailing Divergence of Character and the Extinction of less-improved forms. Thus, from the war of nature, from famine and death, the most exalted object which we are capable of conceiving, namely, the production of the higher animals, directly follows. There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.”
Charles Darwin, On the Origin of Species
“In war, there is the free possibility that not only individual determinacies, but the sum total of these, will be destroyed as life, whether for the absolute itself or for the people. Thus, war preserves the ethical health of peoples in their indifference to determinate things [Bestimmtheiten]; it prevents the latter from hardening, and the people from becoming habituated to them, just as the movement of the winds preserves the seas from that stagnation which a permanent calm would produce, and which a permanent (or indeed ‘perpetual’) peace would produce among peoples.”
“We must recognize that war is common, strife is justice, and all things happen according to strife and necessity…War is father of all and king of all”
What if we remove the outer loss?
In a meta-learning context, it will then either overfit to a single instance of a problem, or learn a potentially arbitrarily suboptimal average response; in the Quake CTF, the inner loss might converge, as mentioned, to every-agent-for-itself or greedy tactical victories guaranteeing strategic losses; in a human, the result would (at present, due to refusal to use artificial selection or genetic engineering) be a gradual buildup of mutation load leading to serious health issues and eventually perhaps a mutational meltdown/error catastrophe; and in an economy, it leads to… the USSR.
The amount of this constraint can vary, based on the greater power of the non-ground-truth optimization and fidelity of replication and accuracy of selection. The Price equation gives us quantitative insight into the conditions under which group selection could work at all: if a NN could only copy itself in a crude and lossy way, meta-learning would not work well in the first place (properties must be preserved from one generation to the next); if a human cell copied itself with an error rate of as much as 1 in millions, humans could never exist because reproductive fitness is too weak a reward to purge the escalating mutation load (selective gain is negative); if bankruptcy becomes more arbitrary and have less to do with consumer demand than acts of god/government, then corporations will become more pathologically inefficient (covariance between traits & fitness too small to accumulate in meaningful ways).
As Shalizi concludes in his review:
Planning is certainly possible within limited domains—at least if we can get good data to the planners—and those limits will expand as computing power grows. But planning is only possible within those domains because making money gives firms (or firm-like entities) an objective function which is both unambiguous and blinkered. Planning for the whole economy would, under the most favorable possible assumptions, be intractable for the foreseeable future, and deciding on a plan runs into difficulties we have no idea how to solve. The sort of efficient planned economy dreamed of by the characters in Red Plenty is something we have no clue of how to bring about, even if we were willing to accept dictatorship to do so.
This is why the planning algorithms cannot simply keep growing and take over all markets: “who watches the watchmen?” As powerful as the various internal organizational and planning algorithms are, and much superior to evolution/market competition, they only optimize surrogate inner losses, which are not the end-goal, and they must be constrained by a ground-truth loss. The reliance on this loss can and should be reduced, but a reduction to zero is undesirable as long as the inner losses converge to any optima different from the ground-truth optima.
Given the often long lifespan of a failing corporation, the difficulty corporations encounter in aligning employees with their goals, and the inability to reproduce their ‘culture’, it is no wonder that group selection in markets is feeble at best, and the outer loss cannot be removed. On the other hand, these failings are not necessarily permanent: as corporations gradually turn into software, which can be copied and exist in much more dynamic markets with faster OODA loops, perhaps we can expect a transition to an era where corporations do replicate precisely & can then start to consistently evolve large increases in efficiency, rapidly exceeding all progress to date.
Brand & Yancey’s 1993 Pain: The Gift No One Wants, pg191–197, recounts Brand’s research in the 1960s–1970s in attempting to create ‘artificial pain’ or ‘pain prosthetics’, which ultimately failed because human perception of pain is marvelously accurate & superior to the crude electronics of the day, but more fundamentally because they discovered the aversiveness of pain was critical to accomplishing the goal of discouraging repetitive or severely-damaging behavior, as the test subjects would simply ignore or disable the devices to get on with whatever they were doing.
My grant application bore the title “A Practical Substitute for Pain.” We proposed developing an artificial pain system to replace the defective system in people who suffered from leprosy, congenital painlessness, diabetic neuropathy, and other nerve disorders. Our proposal stressed the potential economic benefits: by investing a million dollars to find a way to alert such patients to the worst dangers, the government might save many millions in clinical treatment, amputations, and rehabilitation.
The proposal caused a stir at the National Institutes of Health in Washington. They had received applications from scientists who wanted to diminish or abolish pain, but never from one who wished to create pain. Nevertheless, we received funding for the project.
We planned, in effect, to duplicate the human nervous system on a very small scale. We would need a substitute “nerve sensor” to generate signals at the extremity, a “nerve axon” or wiring system to convey the warning message, and a response device to inform the brain of the danger. Excitement grew in the Carville research laboratory. We were attempting something that, to our knowledge, had never been tried.
I subcontracted with the electrical engineering department at Louisiana State University to develop a miniature sensor for measuring temperature and pressure. One of the engineers there joked about the potential for profit: “If our idea works, we’ll have a pain system that warns of danger but doesn’t hurt. In other words, we’ll have the good parts of pain without the bad! Healthy people will demand these gadgets for themselves in place of their own pain systems. Who wouldn’t prefer a warning signal through a hearing aid over real pain in a finger?”
The LSU engineers soon showed us prototype transducers, slim metal disks smaller than a shirt button. Sufficient pressure on these transducers would alter their electrical resistance, triggering an electrical current. They asked our research team to determine what thresholds of pressure should be programmed into the miniature sensors. I replayed my university days in Tommy Lewis’s pain laboratory, with one big difference: now, instead of merely testing the in-built properties of a well-designed human body, I had to think like the designer. What dangers would that body face? How could I quantify those dangers in a way the sensors could measure?
To simplify matters, we focused on fingertips and the soles of feet, the two areas that caused our patients the most problems. But how could we get a mechanical sensor to distinguish between the acceptable pressure of, say, gripping a fork and the unacceptable pressure of gripping a piece of broken glass? How could we calibrate the stress level of ordinary walking and yet allow for the occasional extra stress of stepping off a curb or jumping over a puddle? Our project, which we had begun with such enthusiasm, seemed more and more daunting.
I remembered from student days that nerve cells change their perception of pain in accordance with the body’s needs. We say a finger feels tender: thousands of nerve cells in the damaged tissue automatically lower their threshold of pain to discourage us from using the finger. An infected finger seems as if it is always getting bumped—it “sticks out like a sore thumb”—because inflammation has made it ten times more sensitive to pain. No mechanical transducer could be so responsive to the needs of living tissue.
Every month the optimism level of the researchers went down a notch. Our Carville team, who had made the significant findings about repetitive stress and constant stress, knew that the worst dangers came not from abnormal stresses, but from very normal stresses repeated thousands of times, as in the act of walking. And Sherman the pig23 had demonstrated that a constant pressure as low as one pound per square inch could cause skin damage. How could we possibly program all these variables into a miniature transducer? We would need a computer chip on every sensor just to keep track of changing vulnerability of tissues to damage from repetitive stress. We gained a new respect for the human body’s capacity to sort through such difficult options instantaneously.
After many compromises we settled on baseline pressures and temperatures to activate the sensors, and then designed a glove and a sock to incorporate several transducers. At last we could test our substitute pain system on actual patients. Now we ran into mechanical problems. The sensors, state-of-the-art electronic miniatures, tended to deteriorate from metal fatigue or corrosion after a few hundred uses. Short-circuits made them fire off false alarms, which aggravated our volunteer patients. Worse, the sensors cost about $2,060 each and a leprosy patient who took a long walk around the hospital grounds could wear out a $9,156 sock!
On average, a set of transducers held up to normal wear-and-tear for one or two weeks. We certainly could not afford to let a patient wear one of our expensive gloves for a task like raking leaves or pounding a hammer—the very activities we were trying to make safe. Before long the patients were worrying more about protecting our transducers, their supposed protectors, than about protecting themselves.
Even when the transducers worked correctly, the entire system was contingent on the free will of the patients. We had grandly talked of retaining “the good parts of pain without the bad,” which meant designing a warning system that would not hurt. First we tried a device like a hearing aid that would hum when the sensors were receiving normal pressures, buzz when they were in slight danger, and emit a piercing sound when they perceived an actual danger. But when a patient with a damaged hand turned a screwdriver too hard, and the loud warning signal went off, he would simply override it—This glove is always sending out false signals—and turn the screwdriver anyway. Blinking lights failed for the same reason.
Patients who perceived “pain” only in the abstract could not be persuaded to trust the artificial sensors. Or they became bored with the signals and ignored them. The sobering realization dawned on us that unless we built in a quality of compulsion, our substitute system would never work. Being alerted to the danger was not enough; our patients had to be forced to respond. Professor Tims of LSU said to me, almost in despair, “Paul, it’s no use. We’ll never be able to protect these limbs unless the signal really hurts. Surely there must be some way to hurt your patients enough to make them pay attention.”
We tried every alternative before resorting to pain, and finally concluded Tims was right: the stimulus had to be unpleasant, just as pain is unpleasant. One of Tims’s graduate students developed a small battery-operated coil that, when activated, sent out an electric shock at high voltage but low current. It was harmless but painful, at least when applied to parts of the body that could feel pain.
Leprosy bacilli, favoring the cooler parts of the body, usually left warm regions such as the armpit undisturbed, and so we began taping the electric coil to patients’ armpits for our tests. Some volunteers dropped out of the program, but a few brave ones stayed on. I noticed, though, that they viewed pain from our artificial sensors in a different way than pain from natural sources. They tended to see the electric shocks as punishment for breaking rules, not as messages from an endangered body part. They responded with resentment, not an instinct of self-preservation, because our artificial system had no innate link to their sense of self. How could it, when they felt a jolt in the armpit for something happening to the hand?
I learned a fundamental distinction: a person who never feels pain is task-oriented, whereas a person who has an intact pain system is self-oriented. The painless person may know by a signal that a certain action is harmful, but if he really wants to, he does it anyway. The pain-sensitive person, no matter how much he wants to do something, will stop for pain, because deep in his psyche he knows that preserving his own self is more significant than anything he might want to do.
Our project went through many stages, consuming five years of laboratory research, thousands of man-hours, and more than a million dollars of government funds. In the end we had to abandon the entire scheme. A warning system suitable for just one hand was exorbitantly expensive, subject to frequent mechanical breakdown, and hopelessly inadequate to interpret the profusion of sensations that constitute touch and pain. Most important, we found no way around the fundamental weakness in our system: it remained under the patient’s control. If the patient did not want to heed the warnings from our sensors, he could always find a way to bypass the whole system.
Looking back, I can point to a single instant when I knew for certain that the substitute pain project would not succeed. I was looking for a tool in the manual arts workshop when Charles, one of our volunteer patients, came in to replace a gasket on a motorcycle engine. He wheeled the bike across the concrete floor, kicked down the kickstand, and set to work on the gasoline engine. I watched him out of the corner of my eye. Charles was one of our most conscientious volunteers, and I was eager to see how the artificial pain sensors on his glove would perform.
One of the engine bolts had apparently rusted, and Charles made several attempts to loosen it with a wrench. It did not give. I saw him put some force behind the wrench, and then stop abruptly, jerking backward. The electric coil must have jolted him. (I could never avoid wincing when I saw our man-made pain system function as it was designed to do.) Charles studied the situation for a moment, then reached up under his armpit and disconnected a wire. He forced the bolt loose with a big wrench, put his hand in his shirt again, and reconnected the wire. It was then that I knew we had failed. Any system that allowed our patients freedom of choice was doomed.
I never fulfilled my dream of “a practical substitute for pain,” but the process did at last set to rest the two questions that had long haunted me. Why must pain be unpleasant? Why must pain persist? Our system failed for the precise reason that we could not effectively reproduce those two qualities of pain. The mysterious power of the human brain can force a person to STOP!—something I could never accomplish with my substitute system. And “natural” pain will persist as long as danger threatens, whether we want it to or not; unlike my substitute system, it cannot be switched off.
As I worked on the substitute system, I sometimes thought of my rheumatoid arthritis patients, who yearned for just the sort of on-off switch we were installing. If rheumatoid patients had a switch or a wire they could disconnect, most would destroy their hands in days or weeks. How fortunate, I thought, that for most of us the pain switch will always remain out of reach.
More Simon 1991:
Over a span of years, a large fraction of all economic activity has been gathered within the walls of large and steadily growing organizations. The green areas observed by our Martian have grown steadily. Ijiri and I have suggested that the growth of organizations may have only a little to do with efficiency (especially since, in most large-scale enterprises, economies and diseconomies of scale are quite small), but may be produced mainly by simple stochastic growth mechanisms (Ijiri and Simon, 1977).
But if particular coordination mechanisms do not determine exactly where the boundaries between organizations and markets will lie, the existence and effectiveness of large organizations does depend on some adequate set of powerful coordinating mechanisms being available. These means of coordination in organizations, taken in combination with the motivational mechanisms discussed earlier, create possibilities for enhancing productivity and efficiency through the division of labor and specialization.
In general, as specialization of tasks proceeds, the interdependency of the specialized parts increases. Hence a structure with effective mechanisms for coordination can carry specialization further than a structure lacking these mechanisms. It has sometimes been argued that specialization of work in modern industry proceeded quite independently of the rise of the factory system. This may have been true of the early phases of the industrial revolution, but would be hard to sustain in relation to contemporary factories. With the combination of authority relations, their motivational foundations, a repertory of coordinative mechanisms, and the division of labor, we arrive at the large hierarchical organizations that are so characteristic of modern life.
In RL terms, evolution, like Evolution Strategies, are a kind of Monte Carlo method. Monte Carlo methods require no knowledge or model of the environment, benefit from low bias, can handle even long-term consequences with ease, do not diverge or fail or are biased like approaches using bootstrapping (especially in the case of the “deadly triad”), is decentralized/embarrassingly parallel. A major downside, of course, is that they accomplish all this by being extremely high-variance/sample-inefficient (eg Salimans et al 2017 is ~10x worse than competing DRL methods).↩︎
And note the irony of the widely-cited corn nixtamalization & anti-cyanide cassava examples of how farming encodes subtle wisdom due to group selection: in both cases, the groups that developed it in the Americas were, despite their superior local food processing, highly ‘unfit’ and suffered enormous population declines due to pandemic & conquest! You might object that those were exogenous factors, bad luck, due to things unrelated to their food processing… which is precisely the problem when selecting on groups.↩︎
An example of the failure of traditional medicine is provided by the NCI anti-cancer plant screening program, run by an enthusiast for medical folklore & ethnobotany who specifically targeted plants based on a “a massive literature search, including ancient Chinese, Egyptian, Greek, and Roman texts”. The screening program screened “some 12,000 to 13,000 species…over 114,000 extracts were tested for antitumor activity” (rates rising steeply afterwards), which yielded 3 drugs ever (paclitaxel/Taxol/PTX, irinotecan, and rubitecan), only one of which was all that important (Taxol). So, in a period with few useful anti-cancer drugs to compete against, large-scale screening of all the low-hanging fruit, targeting plants prized by traditional medical practices from throughout history & across the globe, had a success rate somewhere on the order of 0.007%.
A recent example is the anti-malarial drug artemisinin, which earned its discoverer, Tu Youyou, a 2015 Nobel; she worked in a lab dedicated to traditional herbal medicine (Mao Zedong encouraged the construction of a ‘traditional Chinese medicine’ as a way to reduce medical expenses and conserve foreign currency). She discovered it in 1972, after screening several thousand traditional Chinese remedies. Artemisinin is important, and one might ask what else her lab discovered in the treasure trove of traditional Chinese medicine in the intervening 43 years; the answer, apparently, is ‘nothing’.
While Taxol and artemisinin may justify plant screening on a pure cost-benefit basis (such a hit rate does not appear much worse than other methods, although one should note that the profit-hungry pharmaceutical industry does not prioritize or invest much in ‘bioprospecting’), the more important lesson here is about the accuracy of ‘traditional medicine’. Traditional medicine affords an excellent test case for ‘the wisdom of tradition’: medicine has hard endpoints as it is literally a matter of life and death, is an issue during every individual’s life at the individual level (rather than occasionally at the group level), effects can be extremely large (bordering on ‘silver bullet’ level) and tens of thousands or hundreds of thousands of years have passed for accumulation & selection. Given all of these favorable factors, can the wisdom of tradition still overcome the serious statistical difficulties and cognitive biases leading to false beliefs? Well, the best success stories of traditional medicine have accuracy rates like… <1%. So much for the ‘wisdom of tradition’.↩︎
Brand also notes of a leprosy patient whose nerves had been deadened by it:
As I watched, this man tucked his crutches under his arm and began to run on both feet with a very lopsided gait….He ended up near the head of the line, where he stood panting, leaning on his crutches, wearing a smile of triumph…By running on an already dislocated ankle, he had put far too much force on the end of his leg bone and the skin had broken under the stress…I knelt beside him and found that small stones and twigs had jammed through the end of the bone into the marrow cavity. I had no choice but to amputate the leg below the knee.
These two scenes have long haunted me.
An example quote from Brand & Yancey’s 1993 Pain: The Gift No One Wants about congenital pain insensitivity:
When I unwrapped the last bandage, I found grossly infected ulcers on the soles of both feet. Ever so gently I probed the wounds, glancing at Tanya’s face for some reaction. She showed none. The probe pushed easily through soft, necrotic tissue, and I could even see the white gleam of bare bone. Still no reaction from Tanya.
…her mother told me Tanya’s story…“A few minutes later I went into Tanya’s room and found her sitting on the floor of the playpen, fingerpainting red swirls on the white plastic sheet. I didn’t grasp the situation at first, but when I got closer I screamed. It was horrible. The tip of Tanya’s finger was mangled and bleeding, and it was her own blood she was using to make those designs on the sheets. I yelled, ‘Tanya, what happened!’ She grinned at me, and that’s when I saw the streaks of blood on her teeth. She had bitten off the tip of her finger and was playing in the blood.”
…The toddler laughed at spankings and other physical threats, and indeed seemed immune to all punishment. To get her way she merely had to lift a finger to her teeth and pretend to bite, and her parents capitulated at once. The parents’ horror turned to despair as wounds mysteriously appeared on one of Tanya’s fingers after another…I asked about the foot injuries. “They began as soon as she learned to walk,” the mother replied. “She’d step on a nail or thumbtack and not bother to pull it out. Now I check her feet at the end of every day, and often I discover a new wound or open sore. If she twists an ankle, she doesn’t limp, and so it twists again and again. An orthopedic specialist told me she’s permanently damaged the joint. If we wrap her feet for protection, sometimes in a fit of anger she’ll tear off the bandages. Once she ripped open plaster cast with her bare fingers.”
…Tanya suffered from a rare genetic defect known informally as “congenital indifference to pain”…Nerves in her hands and feet transmitted messages—she felt a kind of tingling when she burned herself or bit a finger—but these carried no hint of unpleasantness…She rather enjoyed the tingling sensations, especially when they produced such dramatic reactions in others…Tanya, now 11, was living a pathetic existence in an institution. She had lost both legs to amputation: she had refused to wear proper shoes and that, coupled with her failure to limp or shift weight when standing (because she felt no discomfort), had eventually put intolerable pressure on her joints. Tanya had also lost most of her fingers. Her elbows were constantly dislocated. She suffered the effects of chronic sepsis from ulcers on her hands and amputation stumps. Her tongue was lacerated and badly scarred from her nervous habit of chewing it.
One of the first known cases was described in Dearborn 1932, of a man with a remarkable career of injuries as a child ranging from being hoisted by a pick-axe to a hatchet getting stuck in his head to shooting himself in the index finger, culminating in a multi-year career as the “Human Pincushion”.↩︎
As a child, she had bitten off the tip of her tongue while chewing food, and has suffered third-degree burns after kneeling on a hot radiator to look out of the window…Miss C. had severe medical problems. She exhibited pathological changes in her knees, hip and spine, and underwent several orthopedic operations. Her surgeon attributed these changes to the lack of protection to joints usually given by pain sensation. She apparently failed to shift her weight when standing, to turn over in her sleep, or to avoid certain postures, which normally prevent the inflammation of joints. All of us quite frequently stumble, fall or wrench a muscle during ordinary activity. After these trivial injuries, we limp a little or we protect the joint so that it remains unstressed during the recovery process. This resting of the damaged area is an essential part of its recovery. But those who feel no pain go on using the joint, adding insult to injury.
A recent US example is Minnesotan Gabby Gingras (b. 2001), featured in the 2005 documentary A Life Without Pain, and occasionally covered in the media since (eg “Medical Mystery: A World Without Pain: A rare genetic disorder leaves one little girl in constant danger”, “Minnesota girl who can’t feel pain battles insurance company”). She is legally blind (having damaged her eyes & defeated attempts to save her vision like stitching her eyes shut), her baby teeth were removed to avoid her breaking them but then she broke her adult teeth, and can’t use dentures because her gums are so badly destroyed, which requires special surgery to graft bone from her hips into her jaw to provide a foundation for teeth.↩︎
A genetics paper, Habib et al 2019 has a profile of a pain-insensitive patient (which is particularly eyebrow-raising in light of earlier discussions of joint damage):
The patient had been diagnosed with osteoarthritis of the hip, which she reported as painless, which was not consistent with the severe degree of joint degeneration. At 65 yr of age, she had undergone a hip replacement and was administered only paracetamol 2g orally on Postoperative days 1 and 2, reporting that she was encouraged to take the paracetamol, but that she did not ask for any analgesics. She was also administered a single dose of morphine sulphate 10mg orally on the first postoperative evening that caused severe nausea and vomiting for 2 days. After operation, her pain intensity scores were throughout except for one score of on the first postoperative evening. Her past surgical history was notable for multiple varicose vein and dental procedures for which she has never required analgesia. She also reported a long history of painless injuries (e.g. suturing of a laceration and left wrist fracture) for which she did not use analgesics. She reported numerous burns and cuts without pain (Supplementary Fig. S1), often smelling her burning flesh before noticing any injury, and that these wounds healed quickly with little or no residual scar. She reported eating Scotch bonnet chili peppers without any discomfort, but a short-lasting “pleasant glow” in her mouth. She described sweating normally in warm conditions.
Brand’s Pain: The Gift No One Wants (pg209–211) describes meeting an Indian woman whose pain was cured by a lobotomy (designed to sever as little of the prefrontal cortex as possible), who described it in almost exactly the same term as Dennett’s paraphrase: “When I inquired about the pain, she said, ‘Oh, yes, it’s still there. I just don’t worry about it anymore.’ She smiled sweetly and chuckled to herself. ‘In fact, it’s still agonizing. But I don’t mind.’” (Dennett elsewhere draws a connection between ‘not minding’ and Zen Buddhism.) See also Barber 1959.↩︎
Amnesiacs apparently may still be able to learn fear or pain associations with unpleasant stimuli despite their memory impairment and sometimes reduced pain sensitivity, which makes them a borderline case here: the aversiveness outlasts the (remembered) qualia.↩︎
The notation and terminology used in this paper is standard in DP and optimal control, and in an effort to forestall confusion of readers that are accustomed to either the reinforcement learning or the optimal control terminology, we provide a list of selected terms commonly used in reinforcement learning (for example in the popular book by Sutton and Barto [SuB98], and its 2018 on-line 2nd edition), and their optimal control counterparts.
- Agent = Controller or decision maker.
- Action = Control.
- Environment = System.
- Reward of a stage = (Opposite of) Cost of a stage.
- State value = (Opposite of) Cost of a state.
- Value (or state-value) function = (Opposite of) Cost function.
- Maximizing the value function = Minimizing the cost function.
- Action (or state-action) value = Q-factor of a state-control pair.
- Planning = Solving a DP problem with a known mathematical model.
- Learning = Solving a DP problem in model-free fashion.
- Self-learning (or self-play in the context of games) = Solving a DP problem using policy iteration.
- Deep reinforcement learning = Approximate DP using value and/or policy approximation with deep neural networks.
- Prediction = Policy evaluation.
- Generalized policy iteration = Optimistic policy iteration.
- State abstraction = Aggregation.
- Episodic task or episode = Finite-step system trajectory.
- Continuing task = Infinite-step system trajectory.
- Afterstate = Post-decision state.
There are some examples of “Reward hacking” in past RL research which resemble such ‘self-injuring’ agents—for example, a bicycle agent is ‘rewarded’ for getting near a target (but not ‘punished’ for moving away), so it learn to steer toward it in a loop to go around it repeatedly to earn the reward.↩︎
From the Marsili article:
In the mid-2000s, Wood’s lab at University College partnered with a Cambridge University scientist named Geoff Woods on a pioneering research project centered on a group of related families—all from a clan known as the Qureshi biradari—in rural northern Pakistan. Woods had learned about the families accidentally: On the hunt for potential test subjects for a study on the brain abnormality microcephaly, he heard about a young street performer, a boy who routinely injured himself (walking across burning coals, stabbing himself with knives) for the entertainment of crowds. The boy was rumored to feel no pain at all, a trait he was said to share with other family members…When Woods found the boy’s family, they told him that the boy had died from injuries sustained during a stunt leap from a rooftop.
Drescher 2004 gives a similar account of motivational pain in Good and Real (pg77–78):
But a merely mechanical state could not have the property of being intrinsically desirable or undesirable; inherently good or bad sensations, therefore, would be irreconcilable with the idea of a fully mechanical mind. Actually, though, it is your machinery’s very response to a state’s utility designation—the machinery’s very tendency to systematically pursue or avoid the state—that implements and constitutes a valued state’s seemingly inherent deservedness of being pursued or avoided. Roughly speaking, it’s not that you avoid pain (other things being equal) in part because pain is inherently bad; rather, your machinery’s systematic tendency to avoid pain (other things being equal) is what constitutes its being bad. That systematic tendency is what you’re really observing when you contemplate a pain and observe that it is “undesirable”, that it is something you want to avoid.
The systematic tendency I refer to includes, crucially, the tendency to plan to achieve positively valued states (and then to carry out the plan), or to plan the avoidance of negatively valued states. In contrast, for example, sneezing is an insistent response to certain stimuli; yet despite the strength of the urge—sneezing can be very hard to suppress—we do not regard the sensation of sneezing as strongly pleasurable (nor the incipient-sneeze tingle, subsequently extinguished by the sneeze, as strongly unpleasant). The difference, I propose, is that nothing in our machinery inclines us to plan our way into situations that make us sneeze (and nothing strongly inclines us to plan the avoidance of an occasional incipient sneeze) for the sake of achieving the sneeze (or avoiding the incipient sneeze); the machinery just isn’t wired up to treat sneezes that way (nor should it be). The sensations we deem pleasurable or painful are those that incline us to plan our way to them or away from them, other things being equal.
This is not about dopaminergic effects being rewarding themselves, but about the perception of current tasks vs alternative tasks. (After all, stimulants don’t simply make you enjoy staring at a wall while doing nothing.) If everything becomes more rewarding, then there is less to gain from switching, because alternatives will be estimated as little more rewarding; or, if reward sensitivity is boosted only for current activities, then there will be pressure against switching tasks, because it is unlikely that alternatives will be predicted to be more rewarding than the current task.↩︎
pg171–172; research on the pig involved paralyzing it & applying slight consistent pressure for 5–7h to spots, which was enough to trigger inflammation & kill hair on the spots.↩︎