Evolution as Backstop for Reinforcement Learning

Markets/evolution as backstops/ground truths for reinforcement learning/optimization: on some connections between Coase's theory of the firm/linear optimization/DRL/evolution/multicellular life/pain as multi-level optimization problems.
topics: Bayes, biology, NN, philosophy
created: 6 Dec 2018; modified: 04 Feb 2019; status: finished; confidence: possible; importance: 0

One defense of free markets notes the inability of non-market mechanisms to solve planning & optimization problems. This has difficulty with Coase’s paradox of the firm, and I note that the difficulty is increased by the fact that with improvements in computers, algorithms, and data, ever larger planning problems are solved. Expanding on some Cosma Shalizi comments, I suggest a multi-level optimization paradigm: many systems can be seen as having two (or more) levels where a slow sample-inefficient but ground-truth outer loss such as death, bankruptcy, or reproductive fitness, trains & constrains a fast sample-efficient but possibly misguided inner loss which is used by learned mechanisms such as neural networks or linear programming group selection perspective. So, one reason for free market or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms. I illustrate by discussing corporations, multicellular life, reinforcement learning & meta-learning in AI, and pain in humans. This view suggests that are inherent balances between market/non-market mechanisms which reflect the relative advantages between a slow unbiased method and faster but potentially arbitrarily biased methods.

In Coase’s theory of the firm, a paradox is noted: idealized competitive markets are optimal for allocating resources and making decisions to reach efficient outcomes, but each market is made up of participants such as large multinational mega-corporations which are not internally made of markets and make their decisions by non-market mechanisms, even things which could clearly be outsourced. In an oft-quoted and amusing passage, Herbert Simon dramatizes the actual situation:

Suppose that [a mythical visitor from Mars] approaches the Earth from space, equipped with a telescope that revels social structures. The firms reveal themselves, say, as solid green areas with faint interior contours marking out divisions and departments. Market transactions show as red lines connecting firms, forming a network in the spaces between them. Within firms (and perhaps even between them) the approaching visitor also sees pale blue lines, the lines of authority connecting bosses with various levels of workers. As our visitors looked more carefully at the scene beneath, it might see one of the green masses divide, as a firm divested itself of one of its divisions. Or it might see one green object gobble up another. At this distance, the departing golden parachutes would probably not be visible. No matter whether our visitor approached the United States or the Soviet Union, urban China or the European Community, the greater part of the space below it would be within green areas, for almost all of the inhabitants would be employees, hence inside the firm boundaries. Organizations would be the dominant feature of the landscape. A message sent back home, describing the scene, would speak of large green areas interconnected by red lines. It would not likely speak of a network of red lines connecting green spots.…When our visitor came to know that the green masses were organizations and the red lines connecting them were market transactions, it might be surprised to hear the structure called a market economy. Wouldn’t organizational economy be the more appropriate term? it might ask.

A free competitive market is a weighing machine, not a thinking machine; it weighs & compares proposed buys & sells made by participants, and reaches a clearing price. But where, then, do the things being weighed come from? Market participants are themselves not markets, and to appeal to the wisdom of the market is buck-passing; if markets elicit information or incentivize performance, how is that information learned and expressed, and where do the actual actions which yield higher performance come from? At some point, someone has to do some real thinking. (A company can outsource its janitors to the free market, but then whatever contractor is hired still has to decide exactly when and where and how to do the janitor-ing; safe to say, it does not hold an internal auction among its janitors to divide up responsibilities and set their schedules.)

The paradox is that free markets appear to depend on entities which are internally run as totalitarian command dictatorships. One might wonder why there is such a thing as a firm, instead of everything being accomplished by exchanges among the most atomic unit (currently) possible, individual humans. Coase’s suggestion is that it is a principal-agent problem: there’s risk, negotiation costs, trade secrets, betrayal, and having a difference between the principal and agent at all can be too expensive & have too much overhead.

Asymptotics Ascendant

An alternative perspective comes from the socialist calculation debate: why have a market at all, with all its waste and competition, if a central planner can plan out optimal allocations and simply decree it? Cosma Shalizi in a review1 of Spufford’s Red Plenty (which draws on Planning Problems in the USSR: The Contribution of Mathematical Economics to their Solution 1960-1971, ed Ellman 1973), discusses the history of linear optimization algorithms, which were also developed in Soviet Russia under Leonid Kantorovich and used for economics planning. One irony (which Shalizi ascribes to Stiglitz) is that under the same theoretical conditions in which markets could lead to an optimal outcome, so too could a linear optimization algorithm. In practice, of course, the Soviet economy couldn’t possibly be run that way because it would require optimizing over millions or billions of variables, requiring unfathomable amounts of computing power.

Optimization Obtained

As it happens, we now have unfathomable amounts of computing power. What was once a modus tollens is now just a modus ponens.

Corporations, and tech companies in particular as the leading edge, routinely solve planning problems for logistics like fleets of cars or datacenter optimization involving millions of variables; the similar SAT solvers are ubiquitous in computer security research for modeling large computer codebases to verify safety or discover vulnerabilities; most robots couldn’t operate without constantly solving & optimizing enormous systems of equations. The internal planned economies of tech companies have grown kudzu-like, sprouting ever larger datasets to predict and automated analyses to plan and market designs to control. The problems solved by retailers like Walmart or Target are world-sized.2 (We are not setting the price. The market is setting the price, he says. We have algorithms to determine what that market is.) The motto of a Google or Amazon or Uber might be (to paraphrase Freeman Dyson’s paraphrase of John von Neumann in Infinite in All Directions, 1988): All processes that are stable we shall plan. All processes that are unstable we shall capitalize (for now). Companies may use some limited internal markets as useful metaphors for allocation, and dabble in prediction markets, but the internal dynamics of tech companies bear little resemblance to competitive free markets, and show little sign of moving in market-ward directions.

The march of planning also shows little sign of stopping. Uber is not going to stop using historical forecasts of demand to move around drivers to meet expected demand and optimize trip trajectories; datacenters will not stop using linear solvers to allocate running jobs to machines in an optimal manner to minimize electricity consumption while balancing against latency and throughput, in search of a virtuous cycle culminating in the optimal route, the perpetual trip, the trip that never ends; markets like smartphone walled gardens rely ever more each year on algorithms parsing human reviews & binaries & clicks to decide how to rank or push advertising and conduct multi-armed bandit exploration of options; and so on endlessly.

So, can we run a economy with scaled-up planning, increasing efficiency and outcompeting competitive markets, as Cockshott & Cottrell propose?


Let’s look at some more examples:

  1. corporations and growth
  2. humans, brains, and cells
  3. meta-learning in AI (particularly RL)

Artificial Persons

The striking thing about corporations improving is that they don’t; corporations don’t evolve. We can copy the best algorithms, like AlphaZero, indefinitely and they will perform as well as the original, and we can tweak them in various ways to make them steadily better (and this is in fact how many algorithms are developed, by constant iteration); species can reproduce themselves, steadily evolving to ever better exploit their niches, not to mention the power of selective breeding programs; individual humans can refine teaching methods and transmit competence (calculus used to be reserved for the most skilled mathematicians, and now is taught to ordinary high school students, and chess grandmasters have become steadily younger with better & more intensive teaching methods like chess engines); we could even clone exceptional individuals to get more similarly talented individuals, if we really wanted to. Why do we not see exceptional corporations clone themselves and take over all market segments? Why don’t corporations evolve such that all corporations or businesses are now the hyper-efficient descendants of a single ur-corporation 50 years ago, all other corporations having gone extinct in bankruptcy or been acquired? Why is it so hard for corporations to keep their culture intact and retain their youthful lean efficiency, or if avoiding aging is impossible, why copy themselves or otherwise reproduce to create new corporations like themselves? Instead, successful large corporations coast on inertia or market failures like regulatory capture/monopoly, while successful small ones worry endlessly about how to preserve their culture or how to stay hungry or find a replacement for the founder as they grow, and there is constant turnover. The large corporations function just well enough that maintaining their existence is an achievement3.

Evolution requires 3 things: entities which can replicate themselves; variation of entities; and selection on entities. Corporations certainly undergo selection for kinds of fitness, and do vary a lot. The problem seems to be that corporations cannot replicate themselves. They can set up new corporations, yes, but that’s not necessarily replicating themselves - they cannot clone themselves the way a bacteria can. When a bacteria clones itself, it has… a clone, which is difficult to distinguish in any way from the original. In sexual organisms, children still resemble their parents to a great extent. But when a large corporation spins off a division or starts a new one, the result may be nothing like the parent and completely lack any secret sauce. A new acquisition will retain its original character and efficiencies (if any). A corporation satisfies the Peter Principle by eventually growing to its level of incompetence, which is always much smaller than the entire economy. Corporations are made of people, not interchangeable easily-copied widgets or strands of DNA. There is no corporate DNA which can be copied to create a new one just like the old. The corporation may not even be able to replicate itself over time, leading to scleroticism and aging - but this then leads to underperformance and eventually selection against it, one way or another. So, an average corporation appears little more efficient, particularly if we exclude any gains from new technologies, than an average corporation 50 years ago, and the challenges and failures of the rare multinational corporation 500 years ago like the Medici bank look strikingly similar to challenges and failures of banks today.

Natural Persons

Contrast that with a human. A human is able to grow over 100 years, with tremendous cooperation between the trillions of cells in their body, only rarely breaking down towards the end with a small handful of seed cancer cells defecting over a lifetime despite even more trillions of cell divisions and replacements. They are also able to be cloned, yielding identical twins so similar across the board that people who know them may be unable to distinguish them. And they don’t need to use evolution or markets to develop these bodies, instead, relying on a complex hardwired developmental program controlled by genes which ensures that >99% of humans get the two pairs of eyes, lungs, legs, brain hemispheres etc that they need. Perhaps the most striking efficiency gain from a human is the possession of a brain with the ability to predict the future, learn highly abstract models of the world, and plan and optimize over these plans for objectives which may only relate indirectly to fitness decades from now or fitness-related events which happen less than once in a lifetime & usually are unobserved or fitness events like that of descendants which can never be observed. Despite ultimately being designed by evolution, evolution then plays no role at runtime and more powerful learning algorithms take over.

Going Meta

Speaking of algorithms, an interesting area of AI and reinforcement learning is meta-learning, usually described as learning to learn. This rewrites a given learning task as a two-level problem, where one seeks a meta-algorithm for a family of problems which then adapts at runtime to the specific problem at hand. (In evolutionary terms, this could be seen as related to a Baldwin effect.) There are many paradigms in meta-learning using various kinds of learning & optimizers; for listing of several recent ones, see Table 1 of Metz et al 2018 (reproduced in an appendix).

For example, one could train an RNN on a left or right T-maze task where the direction with the reward switches at random every once in a while: the RNN has a memory, its hidden state, so after trying the left arm a few times and observing no reward, it can encode the reward has switched to the right, and then decide to go right every time while continuing to encode how many failures it’s had after the switch; when the reward then switches back to the left, after a few failures on the right, the learned rule will fire and it’ll switch back to the left. Without this sequential learning, if it was just trained on a bunch of samples, where half the lefts have a reward and half the rights also have a reward (because of the constant switching), it’ll learn a bad strategy like picking a random choice 50-50, or always going left/right. Another approach is fast weights, where a starting meta-NN observes a few datapoints from a new problem, and then emits the adjusted parameters for a new NN, specialized to the problem, which is then run exactly and receives a reward, so the meta-NN can learn to emit adjusted parameters which will achieve high reward on all problems. A version of this might be the MAML meta-learning algorithms (Finn et al 2017) where a meta-NN is learned which is carefully balanced between possible NNs so that a few finetuning steps of gradient descent training within a new problem specializes it to that problem (one might think of the meta-NN as being a point in the high-dimensional model space which is roughly equidistant from a large number of NNs trained on each individual problem, where tweaking a few parameters controls overall behavior and only those need to be learned from the initial experiences).

An interesting example of this approach is Jaderberg et al 2018, which presents a Quake team FPS agent trained using a two-level approach (and Leibo et al 2018 which extends it further with multiple populations). The FPS game is a multiplayer capture-the-flag match where teams compete on a map, rather than the agent controlling a single agent in a death-match setting; learning to coordinate, as well as explicitly communicate, with multiple copies of oneself is tricky and normal training methods don’t work well because updates change all the other copies of oneself as well and destabilize any communication protocols which have been learned. What Jaderberg does is use normal deep RL techniques within each agent, predicting and receiving rewards within each game based on earning points for flags/attacks, but then the overall population of 30 agents, after each set of matches, undergoes a second level of selection based on final game score/victory, which then selects on the agent’s internal reward prediction & hyperparameters

This can be seen as a two-tier reinforcement learning problem. The inner optimisation maximises Jinner, the agents’ expected future discounted internal rewards. The outer optimisation of Jouter can be viewed as a meta-game, in which the meta-reward of winning the match is maximised with respect to internal reward schemes wp and hyperparameters φp, with the inner optimisation providing the meta transition dynamics. We solve the inner optimisation with RL as previously described, and the outer optimisation with Population Based Training (PBT) (29). PBT is an online evolutionary process which adapts internal rewards and hyperparameters and performs model selection by replacing under-performing agents with mutated versions of better agents. This joint optimisation of the agent policy using RL together with the optimisation of the RL procedure itself towards a high-level goal proves to be an effective and generally applicable strategy, and utilises the potential of combining learning and evolution (2) in large scale learning systems.

The goal is to win, the ground-truth reward is the win/loss, but learning only from win/loss is extremely slow: a single bit (probably less) of information must be split over all actions taken by all agents in the game and used to train NNs with millions of interdependent parameters, in a particularly inefficient way as one cannot compute exact gradients from the win/loss back to the responsible neurons. Within-game points are a much richer form of supervision, more numerous and corresponding to short time segments, allowing for much more learning within each game (possibly using exact gradients), but are only indirectly related to the final win/loss; an agent could rack up many points on its own while neglecting to fight the enemy or coordinate well and ensuring a final defeat, or it could learn a greedy team strategy which performs well initially but loses over the long run. So the two-tier problem uses the slow outer signal or loss function (winning) to sculpt the faster inner loss which does the bulk of the learning. (Organisms are adaptation-executors, not fitness-maximizers.) Should the fast inner algorithms not be learning something useful or go haywire or fall for a trap, the outer rewards will eventually recover from the mistake, by mutating or abandoning them in favor of more successful lineages. This combines the crude, slow, dogged optimization of evolution, with the much faster, more clever, but potentially misguided gradient-based optimization, to produce something which will reach the right goal faster. (Two more recent examples would be surrogate/synthetic gradients.)


Cosma Shalizi, elsewhere, enjoys noting formal identities between natural selection and Bayesian statistics (especially particle filtering) and markets, where the population frequency of an allele corresponds to a parameter’s prior probability or starting wealth of a trader, and fitness differentials/profits correspond to updates based on new evidence. (See also Evstigneev et al 2008/Lensberg & Schenk-Hoppé 2006, Campbell 2016.) While a parameter may start with erroneously low prior, at some point the updates will make the posterior converge on it. (The relationship between populations of individual with noisy fixed beliefs, and Thompson sampling, is also interesting: Krafft 2017.) And stochastic gradient descent can be seen as secretly an approximation or variational form of Bayesian updates by estimating its gradients (because everything that works works because it’s Bayesian?) and of course evolutionary methods can be seen as calculating finite difference approximations to gradients…

Some analogies between different optimization/inference models.
model parameter prior update
evolution: allele population frequency fitness differential
market: trader starting wealth profit
particle filtering: particles population frequency accept-reject sample
SGD: parameter random initialization gradient step

This pattern surfaces in our other examples too. This two-level learning is analogous to meta-learning: the outer or meta-algorithm learns how to generate an inner or object-level algorithm which can learn most effectively, better than the meta-algorithm. It’s also analogous to cells in a human body: overall reproductive fitness is a slow signal that occurs only a few times in a lifetime at most, but over many generations, it builds up fast-reacting developmental and homeostatic processes which can build an efficient and capable body and respond to environmental fluctuations within minutes rather than millennia, and the brain is still superior with split-second situations. It’s also analogous to corporations in a market: the corporation can use whatever internal algorithms it pleases, such as linear optimization or neural networks, and evaluate them internally using internal metrics like number of daily users; but eventually, this must result in profits…

The central problem a corporation solves is how to motivate, organism, punish & reward its sub-units and constituent humans in the absence of direct end-to-end losses without the use of slow external market mechanisms. This is done by tapping into social mechanisms like peer esteem (soldiers don’t fight for their country, they fight for their buddies), selecting workers who are intrinsically motivated to work usefully rather than parasitically, constant attempts to instill a company culture with sloganeering or handbooks or company songs, use of multiple proxy measures for rewards to reduce Goodhart-style reward hacking, ad hoc mechanisms like stock options to try to internalize within workers the market losses, replacing workers with outsourcing or automation, acquiring smaller companies which have not yet decayed internally or as a selection mechanism (acquihires), employing intellectual property or regulation… All of these techniques together can align the parts into something useful to eventually sell…

Man Proposes, God Disposes

…Or else the company will eventually go bankrupt:

Great is Bankruptcy: the great bottomless gulf into which all Falsehoods, public and private, do sink, disappearing; whither, from the first origin of them, they were all doomed. For Nature is true and not a lie. No lie you can speak or act but it will come, after longer or shorter circulation, like a Bill drawn on Nature’s Reality, and be presented there for payment, - with the answer, No effects. Pity only that it often had so long a circulation: that the original forger were so seldom he who bore the final smart of it! Lies, and the burden of evil they bring, are passed on; shifted from back to back, and from rank to rank; and so land ultimately on the dumb lowest rank, who with spade and mattock, with sore heart and empty wallet, daily come in contact with reality, and can pass the cheat no further.

…But with a Fortunatus’ Purse in his pocket, through what length of time might not almost any Falsehood last! Your Society, your Household, practical or spiritual Arrangement, is untrue, unjust, offensive to the eye of God and man. Nevertheless its hearth is warm, its larder well replenished: the innumerable Swiss of Heaven, with a kind of Natural loyalty, gather round it; will prove, by pamphleteering, musketeering, that it is a truth; or if not an unmixed (unearthly, impossible) Truth, then better, a wholesomely attempered one, (as wind is to the shorn lamb), and works well. Changed outlook, however, when purse and larder grow empty! Was your Arrangement so true, so accordant to Nature’s ways, then how, in the name of wonder, has Nature, with her infinite bounty, come to leave it famishing there? To all men, to all women and all children, it is now indubitable that your Arrangement was false. Honour to Bankruptcy; ever righteous on the great scale, though in detail it is so cruel! Under all Falsehoods it works, unweariedly mining. No Falsehood, did it rise heaven-high and cover the world, but Bankruptcy, one day, will sweep it down, and make us free of it.4

A large corporation like Sears may take decades to die (There is a great deal of ruin in a nation, Adam Smith observed), but die it does. Corporations do not increase in performance rapidly and consistently the way selective breeding or AI algorithms do because they cannot replicate themselves as exactly as digital neural networks or biological cells can, but, nevertheless, they are still part of a two-tier process where a ground-truth uncheatable outer loss constrains the internal dynamics to some degree and maintain a baseline or perhaps modest improvement over time. The plan is checked, as Trotsky puts it in criticizing Stalin’s policies like abandoning the NEP, by supply and demand:

If a universal mind existed, of the kind that projected itself into the scientific fancy of Laplace - a mind that could register simultaneously all the processes of nature and society, that could measure the dynamics of their motion, that could forecast the results of their inter-reactions - such a mind, of course, could a priori draw up a faultless and exhaustive economic plan, beginning with the number of acres of wheat down to the last button for a vest. The bureaucracy often imagines that just such a mind is at its disposal; that is why it so easily frees itself from the control of the market and of Soviet democracy. But, in reality, the bureaucracy errs frightfully in its estimate of its spiritual resources.

…The innumerable living participants in the economy, state and private, collective and individual, must serve notice of their needs and of their relative strength not only through the statistical determinations of plan commissions but by the direct pressure of supply and demand. The plan is checked and, to a considerable degree, realized through the market.

Pain Is the Only School-Teacher

Pain is a curious thing. Why do we have painful pain instead of just a more neutral painless pain, when it can backfire so easily as chronic pain, among other problems? Why do we have pain at all instead of regular learning processes or experiencing rewards as we follow plans?

Can we understand pain as another two-level learning process, where a slow but ground-truth outer loss constrains a fast but unreliable inner loss? I would suggest that pain itself is not an outer loss, but the painfulness of pain, its intrusive motivational aspects, is what makes it an outer loss. There is no logical necessity for pain to be pain but this would not be adaptive or practical because it would too easily let the inner loss lead to damaging behavior.

So let’s consider the possibilities when it comes to pain. There isn’t just pain. There is (at the least): useless painful pain (chronic pain, exercise); useful painful pain (the normal sort); useless nonpainful nonpain (dead nerves in diabetes or leprosy or congenital pain insensitivity5); useful nonpainful nonpain (adrenaline rushes during combat); and useless nonpainful pain (pain asymbolia where they maim & kill themselves, possibly also Lesch-Nyhan syndrome); but is there useful painless pain or useless painful nonpain?

A table for clarity:

Utility Aversiveness Qualia presence Examples
useless painful pain chronic pain, exercise
useful painful pain normal/injuries
useless nonpainful pain pain asymbolia
useful nonpainful pain ?
useless painful nonpain ? unconscious processes such as anesthesia awareness? Itches or tickles?
useful painful nonpain cold/heat perception?
useless nonpainful nonpain deadened nerves from diseases (diabetes, leprosy), injury, drugs (anesthetics)
useful nonpainful nonpain adrenaline rush/accidents/combat

Pain serves a clear purpose (stopping us from doing things which may cause damage to our bodies), but in an oddly unrelenting way which we cannot disable and which increasingly often backfires on our long-term interests in the form of chronic pain and other problems. Why doesn’t pain operate more like a warning, or like hunger or thirst? They interrupt our minds, but like a computer popup dialogue, after due consideration of our plans and knowledge, we can generally dismiss them. Pain is the interruption which doesn’t go away, although (Morsella 2005):

Theoretically, nervous mechanisms could have evolved to solve the need for this particular kind of interaction otherwise. Apart from automata, which act like humans but have no phenomenal experience, a conscious nervous system that operates as humans do but does not suffer any internal strife. In such a system, knowledge guiding skeletomotor action would be isomorphic to, and never at odds with, the nature of the phenomenal state - running across the hot desert sand in order to reach water would actually feel good, because performing the action is deemed adaptive. Why our nervous system does not operate with such harmony is perhaps a question that only evolutionary biology can answer. Certainly one can imagine such integration occurring without anything like phenomenal states, but from the present standpoint, this reflects more one’s powers of imagination than what has occurred in the course of evolutionary history.

In the reinforcement learning context, one could ask: does it make a difference whether one has negative or positive rewards? Any reward function with both negative and positive rewards could be turned into all-positive rewards simply by adding a large constant. Is that a difference which makes a difference? Or instead of maximizing positive rewards, one could speak of minimizing losses, and one often does in economics or decision theory or control theory6.

Do Artificial Reinforcement-Learning Agents Matter Morally?, Tomasik 2014, debates the relationship of rewards to considerations of suffering or pain, given the duality between costs-losses/rewards:

Perhaps the more urgent form of refinement than algorithm selection is to replace punishment with rewards within a given algorithm. RL systems vary in whether they use positive, negative, or both types of rewards:

  • In certain RL problems, such as maze-navigation tasks discussed in Sutton and Barto [1998], the rewards are only positive (if the agent reaches a goal) or zero (for non-goal states).
  • Sometimes a mix between positive and negative rewards6 is used. For instance, McCallum [1993] put a simulated mouse in a maze, with a reward of 1 for reaching the goal, -1 for hitting a wall, and -0.1 for any other action.
  • In other situations, the rewards are always negative or zero. For instance, in the cart-pole balancing system of Barto et al. [1990], the agent receives reward of 0 until the pole falls over, at which point the reward is -1. In Koppejan and Whiteson [2011]’s neuroevolutionary RL approach to helicopter control, the RL agent is punished either a little bit, with the negative sum of squared deviations of the helicopter’s positions from its target positions, or a lot if the helicopter crashes.

Just as animal-welfare concerns may motivate incorporation of rewards rather than punishments in training dogs [Hiby et al., 2004] and horses [Warren-Smith and McGreevy, 2007, Innes and McBride, 2008], so too RL-agent welfare can motivate more positive forms of training for artificial learners. Pearce [2007] envisions a future in which agents are driven by gradients of well-being (i.e., positive experiences that are more or less intense) rather than by the distinction between pleasure versus pain. However, it’s not entirely clear where the moral boundary lies between positive versus negative welfare for simple RL systems. We might think that just the sign of the agent’s reward value r would distinguish the cases, but the sign alone may not be enough, as the following section explains.

What’s the boundary between positive and negative welfare?

Consider an RL agent with a fixed life of T time steps. At each time t, the agent receives a non-positive reward rt0r_t \leq 0 as a function of the action ata_t that it takes, such as in the pole-balancing example. The agent chooses its action sequence (at) t=1...Tt=1...T with the goal of maximising the sum of future rewards:

t=1Trt(at)\sum_{t=1}^T r_t(a_t)

Now suppose we rewrite the rewards by adding a huge positive constant c to each of them, rt=rt+cr′t = rt + c, big enough that all of the rtr′_t are positive. The agent now acts so as to optimise

t=1Trt(at)=t=1T((rt)at+c)=Tc+t=1Trt(at)\sum_{t=1}^T r'_t(a_t) = \sum_{t=1}^T ((r_t)a_t + c) = Tc + \sum_{t=1}^T r_t(a_t)

So the optimal action sequence is the same in either case, since additive constants don’t matter to the agent’s behaviour.7 But if behaviour is identical, the only thing that changed was the sign and numerical magnitude of the reward numbers. Yet it seems absurd that the difference between happiness and suffering would depend on whether the numbers used by the algorithm happened to have negative signs in front. After all, in computer binary, negative numbers have no minus sign but are just another sequence of 0s and 1s, and at the level of computer hardware, they look different still. Moreover, if the agent was previously reacting aversively to harmful stimuli, it would continue to do so. As Lenhart K. Schubert explains:8 [This quotation comes from spring 2014 lecture notes (accessed March 2014) for a course called Machines and Consciousness.]

If the shift in origin [to make negative rewards positive] causes no behavioural change, then the robot (analogously, a person) would still behave as if suffering, yelling for help, etc., when injured or otherwise in trouble, so it seems that the pain would not have been banished after all!

So then what distinguishes pleasure from pain?

…A more plausible account is that the difference relates to avoiding versus seeking. A negative experience is one that the agent tries to get out of and do less of in the future. For instance, injury should be an inherently negative experience, because if repairing injury was rewarding for an agent, the agent would seek to injure itself so as to do repairs more often. If we tried to reward avoidance of injury, the agent would seek dangerous situations so that it could enjoy returning to safety.10 [This example comes from Lenhart K. Schubert’s spring 2014 lecture notes (accessed March 2014), for a course called Machines and Consciousness. These thought experiments are not purely academic. We can see an example of maladaptive behaviour resulting from an association of pleasure with injury when people become addicted to the endorphin release of self-harm.]7 Injury needs to be something the agent wants to get as far away from as possible. So, for example, even if vomiting due to food poisoning is the best response you can take given your current situation, the experience should be negative in order to dissuade you from eating spoiled foods again. Still, the distinction between avoiding and seeking isn’t always clear. We experience pleasure due to seeking and consuming food but also pain that motivates us to avoid hunger. Seeking one thing is often equivalent to avoiding another. Likewise with the pole-balancing agent: Is it seeking a balanced pole, or avoiding a pole that falls over?

…Where does all of this leave our pole-balancing agent? Does it suffer constantly, or is it enjoying its efforts? Likewise, is an RL agent that aims to accumulate positive rewards having fun, or is it suffering when its reward is suboptimal?

So with all that for background, what is the purpose of pain?

The purpose of pain, I would say, is as a ground truth or outer loss. (This is a motivational theory of pain with a more sophisticated RL/psychiatric grounding.)

The pain reward/loss cannot be removed entirely for the reasons demonstrated by the diabetics/lepers/congenital insensitives: the unnoticed injuries and the poor planning are ultimately fatal. Without any pain qualia to make pain feel painful, we will do harmful things like run on a broken leg or jump off a roof to impress our friends, or just move in a not-quite-right fashion and a few years later wind up paraplegics. (An intrinsic curiosity drive alone would interact badly with a total absence of painful pain: after all, what is more novel or harder to predict than the strange and unique states which can be reached by self-injury or recklessness?)

If pain couldn’t be removed, could pain be turned into a reward, then? Could we be the equivalent of Morsella’s mind that doesn’t experience pain, as it infers plans and then executes them, experiencing only more or less rewards? It only experience positive rewards (pleasure) as it runs across burning-hot sands, as this is the optimal action for it to be taking according to whatever grand plan it has thought of.

Perhaps we could… but what stops Morsella’s mind from enjoying rewards by literally running in circles on those sands until it dies or is crippled? Morsella’s mind may make a plan and define a reward function which avoids the need for any pain or negative rewards, but what happens if there is any flaw in the computed plan or the reward estimates? Or if the plan is based on mistaken premises? What if the sands are hotter than expected, or if the distance is much further than expected, or if the final goal (perhaps an oasis of water) is not there? Such a mind raises serious questions about learning and dealing with errors: what does such a mind experience when a plan fails? Does it experience nothing? Does it experience a kind of meta-pain?

What pain provides is a constant, ongoing feedback which anchors all the estimates of future rewards based on planning or bootstrapping. It anchors our intelligence in a concrete estimation of bodily integrity: the intactness of skin, the health of skin cells, the lack of damage to muscles, joints sliding and moving as they ought to, and so on. If we are planning well and acting efficiently in the world, we will, in the long run, on average, experience higher levels of bodily integrity and physical health; if we are learning and choosing and planning poorly, then… we won’t. The badness will gradually catch up with us and we may find ourselves blind scarred paraplegics missing fingers and soon to die. A pain that was not painful would not serve this purpose, as it would merely be another kind of tickling sensation. The perceptions in question are simply more ordinary tactile, kinesthetic, thermoreceptor, or other standard categories of perception; without painful pain, a fire burning your hand simply feels warm (before the thermal-perceptive nerves are destroyed and nothing further is felt), and a knife cutting flesh might feel like a rippling stretching rubbing movement.

We might say that a painful pain is a pain which forcibly inserts itself into the planning/optimization process, as a cost or lack of reward to be optimized. A pain which was not motivating is not what we mean by pain at all.8 The motivation itself is the qualia of pain, much like an itch is an ordinary sensation coupled with a motivational urge to scratch. Any mental quality or emotion or sensation which is not accompanied by a demandingness, an involuntary taking-into-consideration, is not pain. The rest of our mind can force its way through pain, if it is sufficiently convinced that there is enough reason to incur the costs of pain because the long-term reward is so great, and we do this all the time: we can convince ourselves to go to the gym, or withstand the vaccination needle, or, in the utmost extremity, saw off a trapped hand to save our life. And if we are mistaken, and the predicted rewards do not arrive, eventually the noisy constant feedback of pain will override the decisions leading to pain, and whatever incorrect beliefs or models led to the incorrect decisions will be adjusted to do better in the future.

But the pain cannot and must not be overridden: human organisms can’t be trusted to simply turn off pain and indulge an idle curiosity about cutting off hands. We are insufficiently intelligent, our priors insufficiently strong, our reasoning and planning too poor, and we must do too much learning within each life to do without pain. Perhaps if we were superintelligent AIs who could trivially plan flawless humanoid locomotion at 1000Hz taking into account all possible damages, or if we were emulated brains sculpted by endless evolutionary procedures to execute perfectly by instinct, or if we were simple amoeba in a Petri dish who had no real choices to make, there would be no need for a pain which was painful. But we are not.

The pain keeps us honest. In the end, pain is our only teacher.

The Perpetual Peace

In war, there is the free possibility that not only individual determinacies, but the sum total of these, will be destroyed as life, whether for the absolute itself or for the people. Thus, war preserves the ethical health of peoples in their indifference to determinate things [Bestimmtheiten]; it prevents the latter from hardening, and the people from becoming habituated to them, just as the movement of the winds preserves the seas from that stagnation which a permanent calm would produce, and which a permanent (or indeed perpetual) peace would produce among peoples.9

What if we remove the outer loss?

In a meta-learning context, it will then either overfit to a single instance of a problem, or learn a potentially arbitrarily suboptimal average response; in the Quake CTF, the inner loss might converge, as mentioned, to every-agent-for-itself or greedy tactical victories guaranteeing strategic losses; in a human, the result would (at present, due to refusal to use artificial selection or genetic engineering) be a gradual buildup of mutation load leading to serious health issues and eventually perhaps a mutational meltdown/error catastrophe; and in an economy, it leads to… the USSR.

The amount of this constraint can vary, based on the greater power of the non-ground-truth optimization and fidelity of replication and accuracy of selection. The Price equation gives us quantitative insight into the conditions under which group selection could work at all: if a NN could only copy itself in a crude and lossy way, meta-learning would not work well in the first place (properties must be preserved from one generation to the next); if a human cell copied itself with an error rate of as much as 1 in millions, humans could never exist because reproductive fitness is too weak a reward to purge the escalating mutation load (selective gain is negative); if bankruptcy becomes more arbitrary and have less to do with consumer demand than acts of god/government, then corporations will become more pathologically inefficient (covariance between traits & fitness too small to accumulate in meaningful ways).

As Shalizi concludes in his review:

Planning is certainly possible within limited domains - at least if we can get good data to the planners - and those limits will expand as computing power grows. But planning is only possible within those domains because making money gives firms (or firm-like entities) an objective function which is both unambiguous and blinkered. Planning for the whole economy would, under the most favorable possible assumptions, be intractable for the foreseeable future, and deciding on a plan runs into difficulties we have no idea how to solve. The sort of efficient planned economy dreamed of by the characters in Red Plenty is something we have no clue of how to bring about, even if we were willing to accept dictatorship to do so.

This is why the planning algorithms cannot simply keep growing and take over all markets: who watches the watchmen? As powerful as the various internal organizational and planning algorithms are, and much superior to evolution/market competition, they only optimize surrogate inner losses, which are not the end-goal, and they must be constrained by a ground-truth loss. The reliance on this loss can and should be reduced, but a reduction to zero is undesirable as long as the inner losses converge to any optima different from the ground-truth optima.

Given the often long lifespan of a failing corporation, the difficulty corporations encounter in aligning employees with their goals, and the inability to reproduce their culture, it is no wonder that group selection in markets is feeble at best, and the outer loss cannot be removed. On the other hand, these failings are not necessarily permanent: as corporations gradually turn into software, which can be copied and exist in much more dynamic markets with faster OODA loops, perhaps we can expect a transition to an era where corporations do replicate precisely & can then start to consistently evolve large increases in efficiency, rapidly exceeding all progress to date.

See also

Appendix: Meta-Learning Paradigms

Metz et al 2018: “Table 1. A comparison of published meta-learning approaches.”
Metz et al 2018: Table 1. A comparison of published meta-learning approaches.

  1. See also SSC & Chris Said’s reviews.

  2. Amusingly, the front of Red Plenty notes a grant from Target to the publisher.

  3. More Simon 1991:

    Over a span of years, a large fraction of all economic activity has been gathered within the walls of large and steadily growing organizations. The green areas observed by our Martian have grown steadily. Ijiri and I have suggested that the growth of organizations may have only a little to do with efficiency (especially since, in most large-scale enterprises, economies and diseconomies of scale are quite small), but may be produced mainly by simple stochastic growth mechanisms (Ijiri and Simon, 1977).

    But if particular coordination mechanisms do not determine exactly where the boundaries between organizations and markets will lie, the existence and effectiveness of large organizations does depend on some adequate set of powerful coordinating mechanisms being available. These means of coordination in organizations, taken in combination with the motivational mechanisms discussed earlier, create possibilities for enhancing productivity and efficiency through the division of labor and specialization.

    In general, as specialization of tasks proceeds, the interdependency of the specialized parts increases. Hence a structure with effective mechanisms for coordination can carry specialization further than a structure lacking these mechanisms. It has sometimes been argued that specialization of work in modern industry proceeded quite independently of the rise of the factory system. This may have been true of the early phases of the industrial revolution, but would be hard to sustain in relation to contemporary factories. With the combination of authority relations, their motivational foundations, a repertory of coordinative mechanisms, and the division of labor, we arrive at the large hierarchical organizations that are so characteristic of modern life.

  4. The French Revolution: A History, by Thomas Carlyle.

  5. See The Hazards of Growing Up Painlessly for a recent example. An example quote from Brand & Yancey’s 1993 Pain: The Gift No One Wants about congenital pain insensitivity & leprosy-induced pain insensitivity:

    When I unwrapped the last bandage, I found grossly infected ulcers on the soles of both feet. Ever so gently I probed the wounds, glancing at Tanya’s face for some reaction. She showed none. The probe pushed easily through soft, necrotic tissue, and I could even see the white gleam of bare bone. Still no reaction from Tanya.

    …her mother told me Tanya’s story…A few minutes later I went into Tanya’s room and found her sitting on the floor of the playpen, fingerpainting red swirls on the white plastic sheet. I didn’t grasp the situation at first, but when I got closer I screamed. It was horrible. The tip of Tanya’s finger was mangled and bleeding, and it was her own blood she was using to make those designs on the sheets. I yelled, Tanya, what happened! She grinned at me, and that’s when I saw the streaks of blood on her teeth. She had bitten off the tip of her finger and was playing in the blood.

    …The toddler laughed at spankings and other physical threats, and indeed seemed immune to all punishment. To get her way she merely had to lift a finger to her teeth and pretend to bite, and her parents capitulated at once. The parents’ horror turned to despair as wounds mysteriously appeared on one of Tanya’s fingers after another…I asked about the foot injuries. They began as soon as she learned to walk, the mother replied. She’d step on a nail or thumbtack and not bother to pull it out. Now I check her feet at the end of every day, and often I discover a new wound or open sore. If she twists an ankle, she doesn’t limp, and so it twists again and again. An orthopedic specialist told me she’s permanently damaged the joint. If we wrap her feet for protection, sometimes in a fit of anger she’ll tear off the bandages. Once she ripped open plaster cast with her bare fingers.

    …Tanya suffered from a rare genetic defect known informally as congenital indifference to pain…Nerves in her hands and feet transmitted messages - she felt a kind of tingling when she burned herself or bit a finger - but these carried no hint of unpleasantness…She rather enjoyed the tingling sensations, especially when they produced such dramatic reactions in others…Tanya, now 11, was living a pathetic existence in an institution. She had lost both legs to amputation: she had refused to wear proper shoes and that, coupled with her failure to limp or shift weight when standing (because she felt no discomfort), had eventually put intolerable pressure on her joints. Tanya had also lost most of her fingers. Her elbows were constantly dislocated. She suffered the effects of chronic sepsis from ulcers on her hands and amputation stumps. Her tongue was lacerated and badly scarred from her nervous habit of chewing it.

    Brand also notes of a leprosy patient whose nerves had been deadened by it:

    As I watched, this man tucked his crutches under his arm and began to run on both feet with a very lopsided gait….He ended up near the head of the line, where he stood panting, leaning on his crutches, wearing a smile of triumph…By running on an already dislocated ankle, he had put far too much force on the end of his leg bone and the skin had broken under the stress…I knelt beside him and found that small stones and twigs had jammed through the end of the bone into the marrow cavity. I had no choice but to amputate the leg below the knee.

    These two scenes have long haunted me.

  6. Bertsekas 2018 helpfully provides a Rosetta stone between optimal control theory & reinforcement learning (see also Powell 2018 & Bertsekas 2019):

    The notation and terminology used in this paper is standard in DP and optimal control, and in an effort to forestall confusion of readers that are accustomed to either the reinforcement learning or the optimal control terminology, we provide a list of selected terms commonly used in reinforcement learning (for example in the popular book by Sutton and Barto [SuB98], and its 2018 on-line 2nd edition), and their optimal control counterparts.

    1. Agent = Controller or decision maker.
    2. Action = Control.
    3. Environment = System.
    4. Reward of a stage = (Opposite of) Cost of a stage.
    5. State value = (Opposite of) Cost of a state.
    6. Value (or state-value) function = (Opposite of) Cost function.
    7. Maximizing the value function = Minimizing the cost function.
    8. Action (or state-action) value = Q-factor of a state-control pair.
    9. Planning = Solving a DP problem with a known mathematical model.
    10. Learning = Solving a DP problem in model-free fashion.
    11. Self-learning (or self-play in the context of games) = Solving a DP problem using policy iteration.
    12. Deep reinforcement learning = Approximate DP using value and/or policy approximation with deep neural networks.
    13. Prediction = Policy evaluation.
    14. Generalized policy iteration = Optimistic policy iteration.
    15. State abstraction = Aggregation.
    16. Episodic task or episode = Finite-step system trajectory.
    17. Continuing task = Infinite-step system trajectory.
    18. Afterstate = Post-decision state.
  7. There are some examples of Reward hacking in past RL research which resemble such self-injuring agents - for example, a bicycle agent is rewarded for getting near a target (but not punished for moving away), so it learn to steer toward it in a loop to go around it repeatedly to earn the reward.

  8. Drescher 2004 gives a similar account of motivational pain in Good and Real (pg77-78):

    But a merely mechanical state could not have the property of being intrinsically desirable or undesirable; inherently good or bad sensations, therefore, would be irreconcilable with the idea of a fully mechanical mind. Actually, though, it is your machinery’s very response to a state’s utility designation - the machinery’s very tendency to systematically pursue or avoid the state - that implements and constitutes a valued state’s seemingly inherent deservedness of being pursued or avoided. Roughly speaking, it’s not that you avoid pain (other things being equal) in part because pain is inherently bad; rather, your machinery’s systematic tendency to avoid pain (other things being equal) is what constitutes its being bad. That systematic tendency is what you’re really observing when you contemplate a pain and observe that it is undesirable, that it is something you want to avoid.

    The systematic tendency I refer to includes, crucially, the tendency to plan to achieve positively valued states (and then to carry out the plan), or to plan the avoidance of negatively valued states. In contrast, for example, sneezing is an insistent response to certain stimuli; yet despite the strength of the urge - sneezing can be very hard to suppress - we do not regard the sensation of sneezing as strongly pleasurable (nor the incipient-sneeze tingle, subsequently extinguished by the sneeze, as strongly unpleasant). The difference, I propose, is that nothing in our machinery inclines us to plan our way into situations that make us sneeze (and nothing strongly inclines us to plan the avoidance of an occasional incipient sneeze) for the sake of achieving the sneeze (or avoiding the incipient sneeze); the machinery just isn’t wired up to treat sneezes that way (nor should it be). The sensations we deem pleasurable or painful are those that incline us to plan our way to them or away from them, other things being equal.

  9. On the Scientific Ways of Treating Natural Law, by Hegel 1803.