Why Tool AIs Want to Be Agent AIs

AIs limited to purely computational inferential tasks (Tool AIs) supporting humans will be less intelligent, efficient, and economically valuable than more autonomous reinforcement-learning AIs (Agent AIs) who act on their own and learn to take actions over choice of computation/data/training/architecture/hyperparameters/external-resource use. (decision theory, statistics, NN, computer science, transhumanism, AI)
created: 7 Sep 2016; modified: 14 Aug 2017; status: finished; confidence: likely; importance: 9

Autonomous AI systems (Agent AIs) trained using reinforcement learning can do harm when they take wrong actions, especially superintelligent Agent AIs. One solution would be to eliminate their agency by not giving AIs the ability to take actions, confining them to purely informational or inferential tasks such as classification or prediction (Tool AIs), and have all actions be approved & executed by humans, giving equivalently superintelligent results without the risk. I argue that this is not an effective solution for two major reasons. First, because Agent AIs will by definition be better at actions than Tool AIs, giving an economic advantage. Secondly, because Agent AIs will be better at inference & learning than Tool AIs, and this is inherently due to their greater agency: the same algorithms which learn how to perform actions can be used to select important datapoints to learn inference over, how long to learn, how to more efficiently execute inference, how to design themselves, how to optimize hyperparameters, how to make use of external resources such as long-term memories or external software or large databases or the Internet, and how best to acquire new data. All of these actions will result in Agent AIs more intelligent than Tool AIs, in addition to their greater economic competitiveness. Thus, Tool AIs will be inferior to Agent AIs in both actions and intelligence, implying use of Tool AIs is a even more highly unstable equilibrium than previously argued, as users of Agent AIs will be able to outcompete them on two dimensions (and not just one).


One proposed solution to AI risk is to suggest that AIs could be limited purely to supervised/unsupervised learning, and not given access to any sort of capability that can directly affect the outside world such as robotic arms. In this framework, AIs are treated purely as mathematical functions mapping data to an output such as a classification probability, similar to a logistic or linear model but far more complex; most deep learning neural networks like ImageNet image classification convolutional neural networks (CNN)s would qualify. The gains from AI then come from training the AI and then asking it many questions which humans then review & implement in the real world as desired. So an AI might be trained on a large dataset of chemical structures labeled by whether they turned out to be a useful drug in humans and asked to classify new chemical structures as useful or non-useful; then doctors would run the actual medical trials on the drug candidates and decide whether to use them in patients etc. Or an AI might look like Google Maps/Waze: it answers your questions about how best to drive places better than any human could, but it does not control any traffic lights country-wide to optimize traffic flows nor will it run a self-driving car to get you there. This theoretically avoids any possible runaway of AIs into malignant or uncaring actors who harm humanity by satisfying dangerous utility functions and developing instrumental drives. After all, if they can’t take any actions, how can they do anything that humans do not approve of?

Two variations on this limiting or boxing theme are

  1. Oracle AI: Nick Bostrom, in Superintelligence (2014) (pg145-158) notes that while they can be easily boxed and in some cases like P/NP problems the answers can be cheaply checked or random subsets expensively verified, there are several issues with oracle AIs:

    • the AI’s definition of resources or staying inside the box can change as it learns more about the world (ontological crises)
    • responses might manipulate users into asking easy (and useless problems)
    • making changes in the world can make it easier to answer questions about, by simplifying or controlling it (All processes that are stable we shall predict. All processes that are unstable we shall control.)
    • even a successfully boxed and safe oracle or tool AI can be misused1
  2. Tool AI (the term, as tool mode or tool AGI, was coined by Holden Karnofsky in a July 2011 discussion & elaborated on in a May 2013 essay, but the idea has probably been proposed before). To quote Karnofsky:

    Google Maps - by which I mean the complete software package including the display of the map itself - does not have a utility that it seeks to maximize. (One could fit a utility function to its actions, as to any set of actions, but there is no single parameter to be maximized driving its operations.)

    Google Maps (as I understand it) considers multiple possible routes, gives each a score based on factors such as distance and likely traffic, and then displays the best-scoring route in a way that makes it easily understood by the user. If I don’t like the route, for whatever reason, I can change some parameters and consider a different route. If I like the route, I can print it out or email it to a friend or send it to my phone’s navigation application. Google Maps has no single parameter it is trying to maximize; it has no reason to try to trick me in order to increase its utility. In short, Google Maps is not an agent, taking actions in order to maximize a utility parameter. It is a tool, generating information and then displaying it in a user-friendly manner for me to consider, use and export or discard as I wish.

    Every software application I know of seems to work essentially the same way, including those that involve (specialized) artificial intelligence such as Google Search, Siri, Watson, Rybka, etc. Some can be put into an agent mode (as Watson was on Jeopardy) but all can easily be set up to be used as tools (for example, Watson can simply display its top candidate answers to a question, with the score for each, without speaking any of them.)…Tool-AGI is not trapped and it is not Unfriendly or Friendly; it has no motivations and no driving utility function of any kind, just like Google Maps. It scores different possibilities and displays its conclusions in a transparent and user-friendly manner, as its instructions say to do; it does not have an overarching want, and so, as with the specialized AIs described above, while it may sometimes misinterpret a question (thereby scoring options poorly and ranking the wrong one #1) there is no reason to expect intentional trickery or manipulation when it comes to displaying its results.

    …Another way of putting this is that a tool has an underlying instruction set that conceptually looks like: (1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc. An agent, by contrast, has an underlying instruction set that conceptually looks like: (1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A. In any AI where (1) is separable (by the programmers) as a distinct step, (2) can be set to the tool version rather than the agent version, and this separability is in fact present with most/all modern software. Note that in the tool version, neither step (1) nor step (2) (nor the combination) constitutes an instruction to maximize a parameter - to describe a program of this kind as wanting something is a category error, and there is no reason to expect its step (2) to be deceptive….This is important because an AGI running in tool mode could be extraordinarily useful but far more safe than an AGI running in agent mode. In fact, if developing Friendly AI is what we seek, a tool-AGI could likely be helpful enough in thinking through this problem as to render any previous work on Friendliness theory moot.

    …Is a tool-AGI possible? I believe that it is, and furthermore that it ought to be our default picture of how AGI will work

    There are similar general issues with Tool AIs as with Oracle AIs:

    • a human checking each result is no guarantee of safety; even Homer nods. A extremely dangerous or subtly dangerous answer might slip through; Stuart Armstrong notes that the summary may simply not mention the important (to humans) downside to a suggestion, or frame it in the most attractive light possible. The more a Tool AI is used, or trusted by users, the less checking will be done of its answers before the user mindlessly implements it.
    • an intelligent, never mind superintelligent Tool AI, will have built-in search processes and planners which may be quite intelligent themselves, and in planning how to plan, discover dangerous instrumental drives and the sub-planning process execute them.2
    • developing a Tool AI in the first place might require another AI, which itself is dangerous

Oracle AIs remain mostly hypothetical because it’s unclear how to write such utility functions. The second approach, Tool AI, is just an extrapolation of current systems but has two major problems aside from the already identified ones which cast doubt on Karnofsky’s claims that Tool AIs would be extraordinarily useful & that we should expect future AGIs to resemble Tool AIs rather than Agent AIs.


First and most commonly pointed out, agent AIs are more economically competitive as they can replace humans in the loop.3 In any sort of process, Amdahl’s law notes that as steps get optimized, the optimization does less and less as the output becomes dominated by the slowest step - if a step only takes 10% of the time or resources, then even infinite optimization of that step down to zero time/resources means that the output will increase by no more than 10%. So if a human overseeing a, say, high-frequency trading (HFT) algorithm, accounts for 50% of the latency in decisions, then the HFT algorithm will never run more than twice as fast as it does now, which is a crippling disadvantage. (Hence, the Knight Capital debacle is not too surprising - no profitable HFT firm could afford to put too many humans into its loops, so when something does go wrong, it can be difficult for humans to figure out the problem & intervene before the losses mount.) As the AI gets better, the gain from replacing the human increases greatly, and may well justify replacing them with an AI inferior in many other respects but superior in some key aspect like cost or speed. This could also apply to error rates - in airline accidents, human error now causes the overwhelming majority of accidents due to their presence as overseers of the autopilots and it’s unclear that a human pilot represents a net safety gain; and in advanced chess, grandmasters initially chose most moves and used the chess AI for checking for tactical errors and blunders, which transitioned through the late 90s and early ’00s to human players (not even grandmasters) turning over most playing to the chess AI but contributing a great deal of win performance by picking & choosing which of several AI-suggested moves to use, but as the chess AIs improved, at some point around 2007 victories increasingly came from the humans making mistakes which the opposing chess AI could exploit, even mistakes as trivial as ’misclicks (on the computer screen), and now in advanced chess, human contribution has decreased to largely preparing the chess AIs’ opening books & looking for novel opening moves which their chess AI can be better prepared for.

At some point, there is not much point to keeping the human in the loop at all since they have little ability to check the AI choices and become deskilled (think drivers following GPS directions), correcting less than they screw up and demonstrating that toolness is no guarantee of safety nor responsible use. (Hence the old joke: the factory of the future will be run by a man and a dog; the dog will be there to keep the man away from the factory controls.) For a successful autonomous program, just keeping up with growth alone makes it difficult to keep humans in the loop; the US drone warfare program has become such a central tool of US warfare that the US Air Force finds it extremely difficult to hire & retain enough human pilots overseeing its drones, and there are indications that operational pressures are slowly eroding the human control & turning them into rubberstamps, and for all its protestations that it would always keep a human in the decision-making loop, the Pentagon is, unsurprisingly, inevitably, sliding towards fully autonomous drone warfare as the next technological step to maintain military superiority over Russia & China. (See Meet The New Mavericks: An Inside Look At America’s Drone Training Program; Future is assured for death-dealing, life-saving drones; Sam Altman’s Manifest Destiny; The Pentagon’s Terminator Conundrum: Robots That Could Kill on Their Own; Attack of the Killer Robots)

Fundamentally, autonomous agent AIs are what we and the free market want; everything else is a surrogate or irrelevant loss function. We don’t want low log-loss error on ImageNet, we want to refind a particular personal photo; we don’t want excellent advice on which stock to buy for a few microseconds, we want a money pump spitting cash at us; we don’t want a drone to tell us where Osama bin Laden was an hour away, we want it to have blown him up when it saw him; we don’t want good advice from Google Maps about what route to drive to our destination, we want to be at our destination without doing any driving etc. Idiosyncratic situations, legal regulation, fears of tail risks from very bad situations, worries about correlated or systematic failures (like hacking a drone fleet), and so on may slow or stop the adoption of Agent AIs - but the pressure will always be there.

So for this reason alone, we expect to see Agent AIs to systematically be preferred over Tool AIs unless they’re considerably worse.


Agent AIs will be chosen over Tool AIs - for reasons aside from not being what anyone wants and something that will be severely penalized by free markets or simply there being multiple agents choosing whether to use a Tool AI or an Agent AI in any kind of competitive scenario - also suffer from the problem that the best Tool AI’s performance/intelligence will be equal to or worse than the best Agent AI, probably worse, and possibly much worse. Bostrom notes that Such creative [dangerous] plans come into view when the [Tool AI] software’s cognitive abilities reach a sufficiently high level.; we might reverse this to say that to make the Tool AI reach a sufficiently high level, we must put such creativity in view. (A linear model may be extremely safe & predictable, but it would be hopeless to expect everyone to use them instead of neural networks.)

An Agent AI clearly benefits from being a better Tool AI, so it can better understand its environment & inputs; but less intuitively, any Tool AI benefits from agentiness. An Agent AI has the potential, often realized in practice, to outperform any Tool AI: it can get better results with less computation, less data, less manual design, less post-processing of its outputs, on harder domains.

(Trivial proof: Agent AIs are supersets of Tool AIs - an Agent AI, by not taking any actions besides communication or random choice, can reduce itself to a Tool AI; so in cases where actions are unhelpful, it performs the same as the Tool AI, and when actions can help, it can perform better; hence, an Agent AI can always match or exceed a Tool AI. At least, assuming sufficient data that in the environments where actions are not helpful, it can learn to stop acting, and in the ones where they are, it has a distant enough horizon to pay for the exploration. Of course, you might agree with this but simply believe that intelligence-wise, Agent AIs == Tool AIs.)

More seriously, not all data is created equal. Not all data points are equally valuable to learn from, require equal amounts of computation, should be treated identically, should inspire identical followup data sampling, or actions. Inference and learning can be much more efficient if the algorithm can choose how to compute on what data with which actions.

There is no hard Cartesian boundary between an algorithm & its environment such that control of the environment is irrelevant to the algorithm and vice-versa and its computation can be carried out without regard to the environment - there are simply many layers between the core of the algorithm and the furthest part of the environment, and the more layers that the algorithm can model & control, the more it can do.4

This is a highly general point which can be applied on many levels. This point often arises in classical statistics/experimental design/decision theory where adaptive techniques can greatly outperform fixed-sample techniques for both inference and actions/losses: an sequential analysis trial testing a hypothesis can often terminate after a fraction of the equivalent fixed-sample trial’s sample size (and/or loss) while exploring multiple questions; an adaptive multi-armed bandit will have much lower regret than any non-adaptive solution, but it will also be inferentially better at estimating which arm is best and what the performance of that arm is (see the best-arm problem: Bubeck et al 2009, Audibert et al 2010, Gabillon et al 2011, Mellor 2014, Jamieson & Nowak 2014, Kaufmann et al 2014), and an adaptive optimal design can constant-factor (gains of 50% or more are possible compared to naive designs like even allocation; McClelland 1997) minimize total variance by focusing on unexpectedly difficult-to-estimate arms (while a fixed-sample trial can be seen as ideal for when one values precise estimates of all arms equally and they have equal variance, which is usually not the case); even a Latin square or blocking or rerandomization design rather than simple randomization can be seen as reflecting this benefit (avoiding the potential for imbalance in allocation across arms by deciding in advance the sequence of actions taken in collecting samples).

The wide variety of uses of action is a major theme in recent work in AI (specifically, deep learning/neural networks) research and increasingly key to achieving the best performance on inferential tasks as well as reinforcement learning/optimization/agent-y tasks. Although these advantages apply to most AI paradigms, because of the power and wide variety of tasks NNs get applied to, and sophisticated architectures, we can see the pervasive advantage of agentiness much more clearly than in narrower contexts like biostatistics.

Actions for intelligence

Roughly, we can try to categorize the different kinds of agentiness by the level of the NN they work on. There are:

  1. actions internal to a computation:

    • inputs
    • intermediate states
    • accessing the external environment
    • amount of computation
    • enforcing constraints/finetuning quality of output
    • changing the loss function applied to output
  2. actions internal to training the NN:

    • the gradient itself
    • size & direction of gradient descent steps on each parameter
    • overall gradient descent learning rate and learning rate schedule
    • choice of data samples to train on
  3. internal to the dataset

    • active learning
    • optimal experiment design
  4. internal to the NN design step

    • hyperparameter optimization
    • NN architecture
  5. internal to interaction with environment

    • adaptive experiment / multi-armed bandit / exploration for reinforcement learning

Actions internal to a computation

Inside a specific NN, while computing the output for an input question, a NN can make choices about how to handle it.

It can choose what parts of the input to run most of its computations on, while throwing away or computing less on other parts of the input, which are less relevant to the output, using attention mechanisms (eg Olah & Carter 2016, Hahn & Keller 2016, Bellver et al 2016, Mansimov et al 2015, Gregor et al 2015, Xu 2015, Larochelle & Hinton 2010, Bahdanau et al 2015, Ranzato 2014, Mnih et al 2014, Sordoni et al 2016, Kaiser & Bengio 2016). Attention mechanisms are responsible for many increases in performance, but especially improvements in RNNs’ ability to do sequence-to-sequence translation by revisiting important parts of the sequence (Vaswani et al 2017), image generation and captioning, and in CNNs’ ability to recognize images by focusing on ambiguous or small parts of the image, even for adversarial examples (Luo et al 2016). They are a major trend in deep learning, as it is often the case that some parts of the input are more important than others.

Many designs can be interpreted as using attention. The bidirectional RNN also often used in natural language translation doesn’t explicitly use attention mechanisms but is believed to help by giving the RNN a second look at the sequence. Indeed, so universal that it often goes without mention is that the LSTM/GRU mechanism which improves almost all RNNs is itself a kind of attention mechanism: the LSTM cells learn which parts of the hidden state/history are important and should be kept, and whether and when the memories should be forgotten and fresh memories loaded into the LSTM cells.

Extending attention, a NN can choose not just which parts of an input to look at multiple times, but also how long to keep computing on it, adaptive computation (Graves 2016a, Figurnov et al 2016, Silver et al 2016b, Zamir et al 2016, Huang et al 2017, Li et al 2017, Wang et al 2017, Teerapittayanon et al 2017, Huang et al 2017): so it iteratively spends more computation on hard parts of problem within a given computational budget5.

Attention generally doesn’t change the nature of the computation aside from the necessity of actions over the input, but actions can be used to bring in different computing paradigms. For example, the entire field of differentiable neural computer/neural Turing machines (Zaremba & Sutskever 2015, Graves et al 2016b) or neural stack machines or neural GPUs or most designs with some sort of scalable external memory mechanism larger than LSTMs (Rae et al 2016) depends on figuring out a clever way to backpropagate through the action of memory accesses or using reinforcement learning techniques like REINFORCE for training the non-differentiable actions. And such an memory is like a database which is constructed on the fly per-problem, so it’ll help with database queries & information retrieval & knowledge graphs (Narasimhan et al 2016, Seo et al 2016, Bachman et al 2016, Buck et al 2017). An intriguing variant on this idea of querying resources is mixture-of-experts (committee machine) NN architectures (Shazeer et al 2016).

Finally, one interesting variant on this theme is treating an inferential or generative problem as a reinforcement learning problem in a sort of environment with global rewards. Many times the standard loss function is inapplicable, or the important things are global, or the task is not really well-defined enough (in a I know it when I see it sense for the human) to nail down as a simple differentiable loss with predefined labels such as in an image classification problem; in these cases, one cannot do standard supervised training to minimize the loss but must start using reinforcement learning to directly optimize a reward - treating outputs such as classification labels as actions with may eventually result in a reward. For example, in a char-RNN generative text model trained by predicting a character conditional on the previous, one can generative reasonable text samples by greedily picking the most likely next character and occasionally a less likely character for diversity, but one can generate much higher quality samples by exploring longer sequences with beam search, and one can improve generation further by adding utility functions for global properties & applying RL algorithms such as Monte Carlo tree search (MCTS) for training or runtime maximization of an overall trait like quality or winning (eg Jaques et al 2016, Norouzi et al 2016, Wu et al 2016, Ranzato et al 2016, Li et al 2016, Silver et al 2016a, Silver et al 2016b, Clark & Manning 2016, Miao & Blunsom 2016, Rennie et al 2016, He et al 2016, Bello et al 2017, Yang et al 2017, Strub et al 2017, Wu et al 2017, Xie et al 2012, Prestwich et al 2017, Paulus et al 2017, Guimaraes et al 2017, Lewis et al 2017, Sakaguchi et al 2017, Supancic III & Ramanan 2017, Pasunuru & Bansai 2017). Most exotically, the loss function can itself be a sort of action/RL setting - consider the close connections (Finn et al 2016, Ho & Ermon 2016, Pfau & Vinyals 2016, Im et al 2016, Goodfellow 2016) between actor-critic reinforcement learning, synthetic gradients (Jaderberg et al 2016), and game-theory-based generative adversarial networks (GANs; Kim et al 2017, Zhu et al 2017).

Actions internal to training

The training of a NN by stochastic gradient descent might seem to be independent of any considerations of actions, but it turns to be another domain where you can go what if we treated this as a MDP? and it’s actually useful. Specifically, gradient descent requires selection of which data to put into a minibatch, how large a change to make to parameters in general based on the error in the current minibatch (the learning rate hyperparameter), or how much to update each individual parameter each minibatch (perhaps having some neurons which get tweaked much less than others). Actions are things like selecting 1 out of n possible minibatches to do gradient descent on, or selecting 1 out of n possible learning rates with the learning rate increasing/decreasing over time (Andrychowicz et al 2016, Fu et al 2016, Xu et al 2016, Jaderberg et al 2016, Wichrowska et al 2017, Hamrick et al 2017, Xu et al 2017; prioritized traces, prioritized experience replay, boosting, hard-negative mining, importance sampling (Katharopoulos & Fleuret 2017), prioritizing hard samples, Fan et al 2016, Salehi et al 2017).

Actions internal to data selection

We have previously looked at sampling from existing datasets: training on hard samples, and so on. One problem with existing datasets is that they can be inefficient - perhaps they have class imbalance problems where some kinds of data are overrepresented and what is really needed for improved performance is more of the other kinds of data. An image classification CNN doesn’t need 99 dog photos & 1 cat photos, it wants 50 dog photos & 50 cat photos. (Quite aside from the fact that there’s not enough information to classify other cat photos based on just 1 exemplar, the CNN will simply learn to always classify photos as dog.) One can try to fix this by choosing predominately from the minority classes, or by changing the loss function to make classifying the minority class correctly much more valuable than classifying the majority class.

Even better is if the NN can somehow ask for new data or be given additional/corrected data when it makes a mistake. This leads us to active learning: given possible additional datapoints (such as a large pool of unlabeled datapoints), the NN can ask for the datapoint which it will learn the most from (Houlsby et al 2011, Islam 2016, Gal 2016, Ling & Fidler 2017, Christiano et al 2017, Sener & Savarese 2017). One could, for example, train a RL agent to query a search engine and select the most useful images/videos for learning a classification task (eg YouTube: Yeung et al 2017). We can think of it as a little analogous to how kids6 ask parents not random questions, but ones they’re most unsure about, with the most implications one way or another. Settles 2010 discusses the practical advantages to machine learning algorithms of careful choice of data points to learn from or label, and gives some of the known theoretical results on how large the benefits can be - on a toy problem, an error rate e decreasing in sample count from 𝒪(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) to 𝒪(log(1ϵ))\mathcal{O}(log(\frac{1}{\epsilon})), or in a Bayesian setting, a decrease of 𝒪(dϵ)\mathcal{O}(\frac{d}{\epsilon}) to 𝒪(dlog(1ϵ))\mathcal{O}(d \cdot log(\frac{1}{\epsilon})). Active learning also connects back, from a machine learning perspective, to some of the statistical areas covering the benefits of adaptive/sequential trials - optimal experiments query the most uncertain aspects, which the most can be learned from.

Actions internal to NN design

Moving on to more familiar territory, we have hyperparameter optimization using random search or grid search or Bayesian Gaussian processes to try training a possible NN, observe interim (Swersky et al 2014) and final performance, and look for better hyperparameters. But if hyperparameters are parameters we don’t know how to learn yet, then we can see the rest of neural network architecture design as being hyperparameters too: what is the principled difference between setting a dropout rate and setting the number of NN layers? Or between setting a learning rate schedule and the width of NN layers or the number of convolutions or what kind of pooling operators are used? There is none; they are all hyperparameters, just that usually we feel it is too difficult for hyperparameter optimization algorithms to handle many options and we limit them to a small set of key hyperparameters and use grad student descent to handle the rest of the design. So… what if we used powerful algorithms (viz. neural networks) to design neural networks & units like LSTMs (Zoph & Le 2016, Baker et al 2016, Chen et al 2016, Duan et al 2016, Wang et al 2016, Castronovo 2016, Ha et al 2016, Fernando et al 2017, Ravi & Larochelle 2017, Negrinho & Gordon 2017, Miikkulainen et al 2017, Real et al 2017, Hu et al 2017, Johnson et al 2017, Veniat & Denoyer 2017, Munkhdalai & Yu 2017, Cai et al 2017, Zoph et al 2017)?

The logical extension of these neural networks all the way down papers is that an actor like Google/Baidu/Facebook/MS could effectively turn NNs into a black box: a user/developer uploads through an API a dataset of input/output pairs of a specified type and a monetary loss function, and a top-level NN running on a large GPU cluster starts autonomously optimizing over architectures & hyperparameters for the NN design which balances GPU cost and the monetary loss, interleaved with further optimization over the thousands of previous submitted tasks, sharing its learning across all of the datasets/loss functions/architectures/hyperparameters, and the original user simply submits future data through the API for processing by the best NN so far. (Google and Facebook have already taken steps toward this using distributed hyperparameter optimization services which benefit from transfer learning across tasks; Google Vizier/HyperTune, FBLearner Flow.)

Actions external to the agent

Finally, we come to actions in environments which aren’t purely virtual. Adaptive experiments, multi-armed bandits, reinforcement learning etc will outperform any purely supervised learning. For example, AlphaGo trained as a pure supervised-learning Tool AI, predicting next moves of human Go games in a KGS dataset, but that was only a prelude to the self-play, which boosted it from professional player to superhuman level; aside from replacing loss functions (a classification loss like log loss vs victory), the AlphaGo NNs were able to explore tactics and positions that never appeared in the original human dataset. The rewards can also help turn an unsupervised problem (what is the structure or label of each frame of a video game?) into more of a semi-supervised problem by providing some sort of meaningful summary: the reward. A DQN Atari Learning Environment (ALE) agent will, without any explicit image classification, learn to recognize & predict objects in a game which are relevant to achieving a high score.


So to put it concretely: CNNs with adaptive computations will be computationally faster for a given accuracy rate than fixed-iteration CNNs, CNNs with attention classify better than CNNs without attention, CNNs with focus over their entire dataset will learn better than CNNs which only get fed random images, CNNs which can ask for specific kinds of images do better than those querying their dataset, CNNs which can trawl through Google Images and locate the most informative one will do better still, CNNs which access rewards from their user about whether the result was useful will deliver more relevant results, CNNs whose hyperparameters are automatically optimized by an RL algorithm (and possibly trained directly by a NN) will perform better than CNNs with handwritten hyperparameters, CNNs whose architecture as well as standard hyperparameters are designed by RL agents will perform better than handwritten CNNs… and so on. (It’s actions all the way down.)

The drawback to all this is the implementation difficulty is higher, the sample efficiency can be better or worse (individual parts will have greater sample-efficiency but data will be used up training the additional flexibility of other parts), and the computation requirements for training can be much higher; but the asymptotic performance is better, and the gap probably grows as GPUs & datasets get bigger and tasks get more difficult & valuable in the real world.

Why You Shouldn’t Be A Tool

Why does treating all these levels as decision or reinforcement learning problems help so much?

One answer is that most points are not near any decision boundary, or are highly predictable and contribute little information. Optimizing explorations can often lead to prediction/classification/inference gains. These points need not be computed extensively, nor trained on much, nor collected further. If a particular combination of variables is already being predicted with high accuracy (perhaps because it’s common), adding even an infinite number of additional samples will do little; one sample from an unsampled region far away from the previous samples may be dramatically informative. A model trained on purely supervised data collected from humans or experts may have huge gaping holes in its understanding, because most of its data will be collected from routine use and will not sample many regions of state-space, leading to well-known brittleness and bizarre extrapolations, caused by precisely the fact that the humans/experts avoid the dumbest & most catastrophic mistakes and those situations are not represented in the dataset at all! (Thus, a Tool AI might be safe in the sense that it is not an agent, but very unsafe because it is dumb as soon as it goes outside of routine use.) Such flaws in the discriminative model would be exposed quickly in any kind of real world or competitive setting or by RL training.7 You need the right data, not more data. (39. Re graphics: A picture is worth 10K words - but only those to describe the picture. Hardly any sets of 10K words can be adequately described with pictures.)

Another answer is the curse of dimensionality: in many environments, the tree of possible actions and subsequent rewards grows exponentially, so any sequence of actions over more than a few timesteps is increasingly unlikely to ever be sampled, and sparse rewards will be increasingly likely to be observed. Even if an important trajectory is executed at random and a reward obtained, it will be equally unlikely to ever be executed again - whereas some sort of RL agent, whose beliefs affect its choice of actions, can sample the important trajectory repeatedly, and rapidly converge on an estimate of its high value and continue exploring more deeply.

A dataset of randomly generated sequences of robot arm movements intended to grip an object would likely include no rewards (successful grips) at all, because it requires a long sequence of finely calibrated arm movements; with no successes, how could the tool AI learn to manipulate an arm? It must be able to make progress by testing its best arm movement sequence candidate, then learn from that and test the better arm movement, and so on, until it succeeds. Without any rewards or ability to hone in good actions, only the initial states will be observed and progress will be extremely slow compared to an agent who can take actions and explore novel parts of the environment (eg the problem of Montezuma’s Revenge in the Atari Learning Environment: because of reward sparsity, an epsilon-greedy might as well not be an agent compared to some better method of exploring like density-estimation in Bellemare et al 2016.)

Or imagine training a Go program by creating a large dataset of randomly generated Go boards, then evaluating each possible move’s value by playing out a game between random agents from it; this would not work nearly as well as training on actual human-generated board positions which target the vanishingly small set of high-quality games & moves. The exploration homes in on the exponentially shrinking optimal area of the movement tree based on its current knowledge, discarding the enormous space of bad possible moves. In contrast, a tool AI cannot lift itself up by its bootstraps. It merely gives its best guess on the static current dataset, and that’s that. If you don’t like the results, you can gather more data, but it probably won’t help that much because you’ll give it more of what it already has.

Hence, being an agent is much better than being a tool.

See also

  1. Superintelligence, pg148:

    Even if the oracle itself works exactly as intended, there is a risk that it would be misused. One obvious dimension of this problem is that an oracle AI would be a source of immense power which could give a decisive strategic advantage to its operator. This power might be illegitimate and it might not be used for the common good. Another more subtle but no less important dimension is that the use of an oracle could be extremely dangerous for the operator herself. Similar worries (which involve philosophical as well as technical issues) arise also for other hypothetical castes of superintelligence. We will explore them more thoroughly in Chapter 13. Suffice it here to note that the protocol determining which questions are asked, in which sequence, and how the answers are reported and disseminated could be of great significance. One might also consider whether to try to build the oracle in such a way that it would refuse to answer any question in cases where it predicts that its answering would have consequences classified as catastrophic according to some rough-and-ready criteria.

  2. Superintelligence, pg152-153, pg158:

    With advances in artificial intelligence, it would become possible for the programmer to offload more of the cognitive labor required to figure out how to accomplish a given task. In an extreme case, the programmer would simply specify a formal criterion of what counts as success and leave it to the AI to find a solution. To guide its search, the AI would use a set of powerful heuristics and other methods to discover structure in the space of possible solutions. It would keep searching until it found a solution that satisfied the success criterion…Rudimentary forms of this approach are quite widely deployed today…A second place where trouble could arise is in the course of the software’s operation. If the methods that the software uses to search for a solution are sufficiently sophisticated, they may include provisions for managing the search process itself in an intelligent manner. In this case, the machine running the software may begin to seem less like a mere tool and more like an agent. Thus, the software may start by developing a plan for how to go about its search for a solution. The plan may specify which areas to explore first and with what methods, what data to gather, and how to make best use of available computational resources. In searching for a plan that satisfies the software’s internal criterion (such as yielding a sufficiently high probability of finding a solution satisfying the user-specified criterion within the allotted time), the software may stumble on an unorthodox idea. For instance, it might generate a plan that begins with the acquisition of additional computational resources and the elimination of potential interrupters (such as human beings). Such creative plans come into view when the software’s cognitive abilities reach a sufficiently high level. When the software puts such a plan into action, an existential catastrophe may ensue….The apparent safety of a tool-AI, meanwhile, may be illusory. In order for tools to be versatile enough to substitute for superintelligent agents, they may need to deploy extremely powerful internal search and planning processes. Agent-like behaviors may arise from such processes as an unplanned consequence. In that case, it would be better to design the system to be an agent in the first place, so that the programmers can more easily see what criteria will end up determining the system’s output.

  3. Superintelligence, pg151:

    It might be thought that by expanding the range of tasks done by ordinary software, one could eliminate the need for artificial general intelligence. But the range and diversity of tasks that a general intelligence could profitably perform in a modern economy is enormous. It would be infeasible to create special-purpose software to handle all of those tasks. Even if it could be done, such a project would take a long time to carry out. Before it could be completed, the nature of some of the tasks would have changed, and new tasks would have become relevant. There would be great advantage to having software that can learn on its own to do new tasks, and indeed to discover new tasks in need of doing. But this would require that the software be able to learn, reason, and plan, and to do so in a powerful and robustly cross-domain manner. In other words, it would need general intelligence. Especially relevant for our purposes is the task of software development itself. There would be enormous practical advantages to being able to automate this. Yet the capacity for rapid self-improvement is just the critical property that enables a seed AI to set off an intelligence explosion.

  4. Indeed, while Google Maps was used as a paradigmatic example of a Tool AI, it’s not clear how hard this can be pushed: Google Maps/Waze is, of course, trying to maximize something - traffic & ad revenue. Google Maps, like any Google property, is doubtless constantly running A/B tests on its users to optimize for maximum usage, its users are constantly feeding in data about routes & traffic conditions to Google Maps/Waze through the website interface & smartphone GPS/WiFi geographic logs, and to the extent that users make any use of the information & increase/decrease their use of Google Maps which many do so blindly, Google Maps will get feedback after changing the real world (sometimes to the intense frustration of those affected, who try to manipulate it back)… Is Google Maps/Waze a Tool AI or a very large-scale Agent AI?

    It is in a POMDP environment, it has a clear reward function in terms of website traffic, and it has a wide set of actions it continuously explores with epsilon-greedy randomization; even though it was designed to be a Tool AI, from an abstract perspective, one would have to consider it to have evolved into a weak Agent AI due to its commercial context and use in real-world actions. We might consider Google Maps to be a secret agent: it is not a Tool AI but an Agent AI with a hidden & highly opaque reward function.

  5. If the NN is trained to minimize error alone, it’ll simply spend as much time as possible on every problem; so a cost is imposed on each iteration to encourage it to finish as soon as it has a good answer, and learn to finish sooner. And how do we decide what costs to impose on the NN for deciding whether to loop another time or emit its current best guess as good enough? Well, that’ll depend on the cost of GPUs and the economic activity and the utility of results for the humans…

  6. Kyunghyun Cho, 2015:

    One question I remember came from Tieleman. He asked the panelists about their opinions on active learning/exploration as an option for efficient unsupervised learning. Schmidhuber and Murphy responded, and before I reveal their response, I really liked it. In short (or as much as I’m certain about my memory,) active exploration will happen naturally as the consequence of rewarding better explanation of the world. Knowledge of the surrounding world and its accumulation should be rewarded, and to maximize this reward, an agent or an algorithm will active explore the surrounding area (even without supervision.) According to Murphy, this may reflect how babies learn so quickly without much supervising signal or even without much unsupervised signal (their way of active exploration compensates the lack of unsupervised examples by allowing a baby to collect high quality unsupervised examples.)

  7. An example here might be the use of ladders or mirroring in Go - models trained in a purely supervised fashion on a dataset of Go games can have serious difficulty responding to a ladder or mirror because those strategies are so bad that no human would play them in the dataset. Once the Tool AI has been forced off-policy, its predictions & inferences may become garbage because it’s never seen anything like those states before; an agent will be better off because it’ll have been forced into them by exploration or adversarial training and have learned the proper responses.