Annotated Bibliography of Recommended Materials

The (1) to (5) ratings below indicate each bibliography entry's priority rating, with (1) being the highest. You can set a priority threshold for which materials to show, here:

Contents

The following categories represent clusters of ideas that we believe warrant further attention and research for the purpose of developing human-compatible AI (HCAI), as well as prerequisite materials for approaching them. The categories are not meant to be mutually exclusive or hierarchical. Indeed, many subproblems of HCAI development may end up relating in a chicken-and-egg fashion.
Theoretical Background - Prerequisites from fields like probability theory, logic, theoretical CS, and game theory.
Introduction to AI/ML - Pedagogically helpful AI background.
Prevailing Methods in AI/ML - Some important state-of-the-art methods to know about.
Cybersecurity and AI - Valuable perspectives on adversarial digital behavior.
Broad Perspectives on HCAI - Overviews of challenges and approaches to HCAI.
Corrigibility - Designing AI systems that do not resist being redesigned or switched off.
Foundational Work - Developing the theoretical underpinnings of AI in ways that enable HCAI research.
Interactive AI - Productively incorporating human oversight with AI operations.
Preference Inference - Building AI systems that learn what humans want.
Reward Engineering - Designing objectives for AIs that are sure to result in desirable behavior.
Robustness - Building systems that perform well in novel environments.
Transparency - Enabling human understanding of AI systems.
Cognitive Science - Models of human psychology relevant to HCAI.
Moral Theory - Moral issues that HCAI systems and their developers might encounter.

Background

Theoretical Background

Certain basic mathematical and statistical principles have contributed greatly to the development of today's prevailing methods in AI. This category surveys areas of knowledge such as probability theory, logic, theoretical CS, game theory, and economics that are likely to be helpful in understanding existing AI research, and advancing HCAI research more specifically.
Set your priority threshold for this category:
Course Materials:
(1) Machine Learning (lecture notes), Andrew Ng. Course notes introducing various prevailing ML methods, including SVMs, the Perceptron algorithm, k-means clustering, Gaussian mixtures, the EM Algorithm, factor analysis, PCA, ICA, and RL.
(2) Mathematics of Machine Learning, Philippe Rigollet (2015).
(2) Algorithmic Game Theory (Fall 2013), Tim Roughgarden. Covers the interface of theoretical CS and econ: mechanism design for auctions, algorithms and complexity theory for learning and computing Nash and market equilibria, with case studies.
Textbooks:
(2) Game Theory: Analysis of Conflict, Roger Myerson (1997). Harvard University Press. Well-written text motivating and expositing the theory of bargaining, communication, and cooperation in game theory, with rigorous decision-theoretic foundations and formal game representations.
(2) Mathematical Logic : A course with exercises -- Part I, Cori and Lascar (2001). Oxford University Press. Covers propositional calculus, boolean algebras, predicate calculus, and major completeness theorems demonstrating the adequacy of various proof methods.
(2) Information Theory, Inference, and Learning Algorithms Parts I-III, McKay (2003). Cambridge University Press.
(2) Causality: models, reasoning, and inference, Judea Pearl (2000). Cambridge University Press. A precise methodology for AI systems to learn causal models of the world and ask counterfactual questions such as "What will happen if move my arm here?"
, Cori and Lascar (2001). Oxford University Press. Covers recursion theory, the formalization of arithmetic, and Godel's theorems, illustrating how statements about algorithms can be expressed and proven in the language of arithmetic.
, Boolos and Burgess (2007). Cambridge University Press.
, Wasserman (2004). Springer.
, Osborne (2003). Oxford University Press.
, Gelman and Rubin (2013). CRC Press .
, Ullman and Hopcroft (1990). Cambridge University Press.
, Michael Sipser (2012). Cengage Learning.
, Enderton (2000). Academic Press.
, Monin (2003). Springer.
Videos:
(1) Machine Learning (online course), Andrew Ng. Video lectures introducing linear regression, Octave/MatLab, logistic regression, neural networks, SVMs, and surveying methods for recommender systems, anomaly detection, and "big data" applications.
(2) Game Theory I (Coursera), Jackson, Leyton-Brown, Shoham (2016).
, Jackson, Leyton-Brown, Shoham (2016).
Published Articles:
(2) Algorithmic Statistics, Peter Gacs, John T. Tromp, Paul M.B. Vitanyi (2001). IEEE. Nice overview of some connections between Kolmogorov complexity and information theory.
, Busoniu, Babuska, De Schutter (2008). IEEE.

Introduction to AI/ML

This category surveys pedagogically basic material for understanding today's prevailing methods in AI and machine learning. Some of the methods here are no longer the state of the art, and are included to aid in understanding more advanced current methods.
Set your priority threshold for this category:
Course Materials:
(1) Machine Learning, Prof. Tommi Jaakkola (2006). Perceptrons, SVMs, regularization, regression methods, bias and variance, active learning, model description length, feature selection, boosting, EM, spectral clustering, graphical models.
(2) Deep Reinforcement Learning, OpenAI (2015).
, Norvig and Thrun.
Textbooks:
(1) Deep Learning, Chapters 6-12, Goodfellow, Bengio, and Courville (2016). MIT Press. Covers deep networks, deep feedforward networks, regularization for deep learning, optimization for training deep models, convolutional networks, and recurrent and recursive nets.
(1) Deep Learning, Chapters 1-5, Goodfellow, Bengio, and Courville (2016). MIT Press. Reviews linear algebra, probability and information theory, and numerical computation as part of a course on deep learning.
(1) Reinforcement Learning, Sutton and Barto (1998). MIT Press. Covers multi-arm bandits, finite MDPs, dynamic programming, Monte Carlo methods, TDL, bootstrapping, tabular methods, on-policy and off-policy methods, eligibility traces, and policy gradients.
(2) Machine Learning A Probabilistic Perspective, Murphy (2012). MIT Press. A modern textbook on machine learning.
(2) Pattern Recognition and Machine Learning, Bishop (2006). Springer.
(2) Neural Networks and Deep Learning, Chapters 1-6 (more entry-level), Michael Nielsen (2016). Michael Nielsen.
(2) Artificial Intelligence: A Modern Approach, Chapters 1-17, Stuart Russell and Peter Norvig (2010). Pearson. Used in over 1300 universities in over 110 countries. A comprehensive overview, covering problem-solving, uncertain knowledge and reasoning, learning, communicating, perceiving, and acting.
, Daphne Koller and Nir Friedman (2009). MIT. Comprehensive book on graphical models, which represent probability distributions in a way that is both principled and potentially transparent.
, Nils Nilsson (2010). Stanford. A book-length history of the field of artificial intelligence, up until 2010 or so.
Videos:
(1) Reinforcement Learning, Silver (2015). Lecture course on reinforcement learning taught by David Silver.
(1) Introduction to AI, Pieter Abbeel, Dan Klein (2012). Berkeley. Informed, uninformed, and adversarial search; constraint satisfaction; expectimax; MDPs and RL; graphical models; decision diagrams; naive bayes, perceptrons, clustering; NLP; game-playing; robotics.
(2) Deep Learning, Nando de Freitas (2015).
(2) Machine Learning, Nando de Freitas (2013).
(2) Deep Reinforcement Learning, Schulman (2016).
Published Articles:
(2) A Few Useful Things to Know about Machine Learning, Pedro Domingos (2012). UW. "[D]eveloping successful machine learning applications requires a substantial amount of "black art" that is hard to find in textbooks. This article summarizes twelve key lessons."
, Zoubin Ghahramani (2015). Nature. Reviews the state of the art in Bayesian machine learning, including probabilistic programming, Bayesian optimization, data compression, and automatic model discovery.
, Schmidhuber (2015). Neural Networks.

Prevailing Methods in AI/ML

Many currently known methods from machine learning and other disciplines already illustrate surprising results that should inform what we think is possible for future AI systems. Thus, while we do not yet know for certain what architecture future AI systems will use, it is still important for researchers working on HCAI to have a solid understanding of the current state of the art in AI methods. This category lists background reading that we hope will be beneficial for that purpose.
Set your priority threshold for this category:
Videos:
, Hugo Larochelle (2015).
Published Articles:
(1) Imagenet classification with deep convolutional neural networks, Alex Krizevsky, Ilya Sutskever, Geoff Hinton (2012). NIPS. The seminal paper in deep learning.
(2) Human-level control through deep reinforcement learning, Mnih et al. (2015). Nature.
(2) Mastering the game of Go with deep neural networks and tree search, Silver et al (2016). Nature. Describes DeepMind's superhuman Go-playing AI. Ties together several techniques in reinforcement learning and supervised learning.
(2) Representation Learning: A Review and New Perspectives, Bengio, Courville, Vincent (2013). TPAMI.
, LeCun, Bengio, Hinton (2015). Nature.
, Brenden M. Lake, Ruslan Salakhutdinov, Joshua B. Tenenbaum (2015). MIT. State-of-the-art example of hierarchical structured probabilistic models, probably the main alternative to deep neural nets.

Cybersecurity and AI

Understanding the nature of adversarial behavior in a digital environment is important to modeling how powerful AI systems might attack or defend existing systems, as well as how status quo cyber infrastructure might leave future AI systems vulnerable to attack and manipulation. A cybersecurity perspective is helpful to understanding these dynamics.

(Entries for this category are forthcoming.)

Set your priority threshold for this category:

Broad Perspectives on HCAI (Human-Compatible AI)

There are many problems that need solving to ensure that future, highly advanced AI systems will be safe to use, including mathematical, engineering, and social challenges. We must sometimes take a bird's-eye view on these problems, for example, to notice that technical safety research will not be useful on its own unless the institutions that build AI have the resources and incentives to use that research. This category recommends reading that we think can help develop such a perspective.
Set your priority threshold for this category:
Published Articles:
(1) Artificial Intelligence as a Positive and Negative Factor in Global Risk, Eliezer Yudkowsky (2008). Global Catastrophic Risks, pp. 308-345, New York: Oxford University Press.. Examines ways in which AI could be used to reduce large-scale risks to society, or could be a source of risk in itself.
(2) How We're Predicting AI --- or Failing To, Stuart Armstrong and Kaj Sotala (2012). Beyond AI: Artificial Dreams.
(2) The Value Learning Problem, Nate Soares (2016). IJCAI.
(2) The Superintelligent Will: Motivation and Instrumental Rationality In Advanced Intelligent Agents, Nick Bostrom (2012). nickbostrom.com. Argues that highly intelligent agents can have arbitrary end-goals (e.g. paperclip manufacturing), but are likely pursue certain common subgoals such as resource acquisition.
(2) Research Priorities for Robust and Beneficial Artificial Intelligence, Stuart Russell, Daniel Dewey, Max Tegmark (2015). AI Magazine. Lays out high-level problems in HCAI, including problems in economic impacts of AI, legal and ethical implications of AI, and CS research for AI validation, verification, security, and control.
(2) The Basic AI Drives, Steve Omohundro (2008). AAAI. Informally argues that sufficiently advanced goal-oriented AI systems will pursue certain "drives" such as acquiring resources or increased intelligence that arise as subgoals of many end-goals.
(2) A Model of Pathways to Artificial Superintelligence Catastrophe for Risk and Decision Analysis, Anthony M. Barrett and Seth D. Baum (2016). JETAI. Describes models of superintelligence catastrophe risk as trees representing boolean formulae, and gives recommendations for reducing such risk.
(2) Building Machines That Learn and Think Like People, Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman (2016). MIT. Outlines what current systems based on deep neural networks are still missing in terms of what they can learn and how they learn it (compared to humans), and describes some routes towards these goals.
, Kaj Sotala (2015). AAAI. Argues that relatively few concepts may be needed for an AI to learn human values, from the observation that perhaps humans do this.
, Tomas Mikolov, Armand Joulin, Marco Baroni (2015). arXiv. How some research scientists at Facebook expect the incremental development of intelligent machines to occur.
, Stuart Armstrong, Anders Sandberg, Nick Bostrom (2012). Minds and Machines. Describes some challenges in the design of a boxed 'Oracle AI' system, that has strictly controlled input and merely answers questions.
, Stuart Armstrong (2013). Springer. Describing Oracle AI and methods of controlling it.
, David Chalmers (2010). Journal of Consciousness Studies. A philosophical discussion of the possibility of an intelligence explosion, how we could ensure good outcomes, and issues of personal identity.
, Bird et al (2002). CEC. An example of evolutionarily-driven learning approaches leading to very unexpected results that aren't what the designer hoped to produce.
, James Babcock, Janos Kramar, and Roman Yampolskiy (2016). AGI.
Unpublished Articles:
(2) Towards Verified Artificial Intelligence, Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry (2016). arXiv.
(2) A survey of research questions for robust and beneficial AI, Future of Life Institute (2016). FLI. Broad discussion of potential research directions related to AI Safety, including short- and long-term technical work, forecasting, and policy.
, Roman V. Yampolskiy (2015). arXiv.
, Nick Bostrom (2016). nickbostrom.com.
, Lukas Gloor (2016). FRI. Discusses extreme downside risks of highly capable AI, such as dystopian futures.
News and Magazine Articles:
(1) Can digital machines think?, Alan Turing (1951). BBC Third Programme. "It is customary [...] to offer a grain of comfort [...] that some particularly human characteristic could never be imitated by a machine. ... I believe that no such bounds can be set."
, Bill Joy (2000). Wired. Prescient, gloomy, and a bit rambling article about humans' role in the future, with genetics, robotics, and nanotech.
, Luke Muehlhauser, Bill Hibbard (2014). ACM. Argues for the importance of thinking about the safety implications of formal models of ASI, to identify both potential problems and lines of research.
Blog Posts:
(2) Technical and social approaches to AI safety, Paul Christiano (2015). Medium.
(2) Three impacts of machine intelligence, Paul Christiano (2014). Medium.
, Jacob Steinhardt (2015). Overviews engineering challenges, potential risks, and research goals for safe AI.
, Tom Dietterich and Eric Horvitz (2015).
Books:
(1) Superintelligence: Paths, Dangers, Strategies, Chapters 1-10,12, Nick Bostrom (2014). Oxford University Press. A New York Times Bestseller analyzing many arguments for and against the potential danger of highly advanced AI systems.

Open Technical Problems

Corrigibility

Consider a cleaning robot whose only objective is to clean your house, and which is intelligent enough to realize that if you turn it off then it won't be able to clean your house anymore. Such a robot has some incentive to resist being shut down or reprogrammed for another task, since that would interfere with its cleaning objective. Loosely speaking, we say that an AI system is incorrigible to the extent that it resists being shut down or reprogrammed by its users or creators, and corrigible to the extent that it allows such interventions on its operation. For example, a corrigible cleaning robot might update its objective function from "clean the house" to "shut down" upon observing that a human user is about to deactivate it. For an AI system operating highly autonomously in ways that can have large world-scale impacts, corrigibility is even more important; this HCAI research category is about developing methods to ensure highly robust and desirable forms of corrigibility. Corrigibility may be a special case of preference inference.
Set your priority threshold for this category:
Published Articles:
(2) Corrigibility, Nate Soares, Benya Fallenstein, Eliezer Yudkowsky, Stuart Armstrong (2015). AAAI. Describes the open problem of corrigibility---designing an agent that doesn't have instrumental incentives to avoid being corrected (e.g. if the human shuts it down or alters its utility function).
, Laurent Orseau, Stuart Armstrong (2016). UAI. A proposal for training a reinforcement learning agent that doesn't learn to avoid interruptions to episodes, such as the human operator shutting it down.
, Tsvi Benson-Tilsen and Nate Soares (2016). AAAI. Discusses objections to the convergent instrumental goals thesis, and gives a simple formal model.
Unpublished Articles:
(2) The Off-Switch Game, Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell (2016). Principled approach to corrigibility and the shutdown problem based on cooperative inverse reinforcement learning.
, Stuart Armstrong (2010). Describes an approach to averting instrumental incentives by "cancelling out" those incentives with artificially introduced terms in the utility function.
Blog Posts:
, Jessica Taylor and Chris Olah (2016). Proposes an objective function that ignores effects through some channel by performing separate causal counterfactuals for each effect of an action.

Foundational Work

Reasoning about highly capable systems before they exist requires certain theoretical assumptions or models from which to deduce their behavior and/or how to align them with our interests. Theoretical results can also reduce our dependency on trial-and-error methodologies for determining safety, which could be important when testing systems is difficult or expensive to do safely. Whereas existing theoretical foundations such as probability theory and game theory have been helpful in developing current approaches to AI, it is possible that additional foundations could be helpful in advancing HCAI research specifically. This category is for theoretical research aimed at expanding those foundations, as judged by their expected usefulness for HCAI.
Set your priority threshold for this category:
Textbooks:
(2) Universal Artificial Intelligence, Chapters 2-5, Hutter (2005). Springer.
Published Articles:
(2) Toward Idealized Decision Theory, Nate Soares and Benja Fallenstein (2014). AGI. Motivates the study of decision theory as necessary for aligning smarter-than-human artificial systems with human interests.
(2) Reflective Solomonoff Induction, Benja Fallenstein and Nate Soares and Jessica Taylor (2015). AGI. Uses reflective oracles to define versions of Solomonoff and AIXI which are contained in and have the same type as their environment, and which in particular reason about themselves.
(2) Reflective Oracles: A Foundation for Game Theory in Artificial Intelligence, Benja Fallenstein and Jessica Taylor and Paul Christiano (2015). Logic, Rationality, and Interaction: 5th International Workshop. Introduces a framework for treating agents and their environments as mathematical objects of the same type, allowing agents to contain models of one another, and converge to Nash equilibria.
(2) Bad Universal Priors and Notions of Optimality, Jan Leike and Marcus Hutter (2015). JMLR. Shows that Legg-Hutter intelligence strongly depends on the universal prior, and some universal priors heavily discourage exploration.
(2) Two Attempts to Formalize Counterpossible Reasoning in Deterministic Settings, Nate Soares and Benja Fallenstein (2015). AGI.
, Laurent Orseau and Mark Ring (2012). AGI.
, Jan Leike (2016). PhD Thesis, ANU. Describes sufficient conditions for learnability of environments, types of agents and their optimality, the computability of those agents, and the Grain of Truth problem.
Unpublished Articles:
(1) Logical Induction (abridged), Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, and Jessica Taylor (2016). MIRI. Proposes a criterion for "good reasoning" using bounded computational resources, and shows that this criterion implies a wide variety of desirable properties.
(2) Parametric Bounded Lob's Theorem and Robust Cooperation of Bounded Agents, Andrew Critch (2016). arXiv. Proves a version of Lob's theorem for bounded reasoners, and discusses relevance to cooperation in the Prisoner's Dilemma and decision theory more broadly.
(2) Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda, Nate Soares and Benja Fallenstein (2015). MIRI. Overview of an agenda to formalize various aspects of human-compatible "naturalized" (embedded in its environment) superintelligence .
, Scott Garrabrant, Tsvi Benson-Tilsen, AC, Nate Soares, and Jessica Taylor (2016). MIRI. Presents an algorithm that uses Brouwer's fixed point theorem to reason inductively about computations using bounded resources, and discusses a corresponding optimality notion.
, Jan Leike, Jessica Taylor, and Benya Fallenstein (2016). UAI. Provides a model of game-theoretic agents that can reason using explicit models of each other, without problems of infinite regress.
, Stuart Armstrong (2010). FHI.
, Nate Soares (2015). MIRI.
, Mihaly Barasz, Paul Christiano, Benja Fallenstein, Marcello Herreshoff, Patrick LaVictoire, and Eliezer Yudkowsky (2013). arXiv. Illustrates how agents formulated in term of provability logic can be designed to condition on each others' behavior in one-shot-games to achieve cooperative equilibria.
, Benja Fallenstein and Nate Soares (2015). MIRI.
, Benja Fallenstein and Ramana Kumar (2016). MIRI.
, Tsvi Benson-Tilsen (2014). MIRI. Agents that only use logical deduction to make decisions may need to "diagonalize against the universe" in order to perform well even in trivially simple environments.
Blog Posts:
, RobbBB (2013). LessWrong.

Interactive AI

Incorporating more human oversight and involvement "in the loop" with an AI system creates unique opportunities for ensuring the alignment of an AI system with human interests. This category surveys approaches to achieving interactivity between humans and machines, and how it might apply to developing human compatible AI. Such approaches often require some degree of both transparency and reward engineering for the system, and as such could also be viewed under those headings.
Set your priority threshold for this category:
Published Articles:
(2) Learning Language Games Through Interaction, Sida I. Wang, Percy Liang, Christopher D. Manning (2016). Association for Computational Linguistics. A framework for posing and solving language-learning goals for an AI, as a cooperative game with a human.
, Saleema Amershi, Maya Cakmak, W. Bradley Knox, Todd Kulesza (2014). AI Magazine. Case studies demonstrating how interactivity results in a tight coupling between the system and the user, how existing systems fail to account for the user, and some directions for improvement.
, Akash Srivastava, James Zou, Ryan P. Adams, Charles Sutton (2016). KDD. Producing diverse clusterings of data by elicit experts to reject clusters.
Blog Posts:
(1) Approval-directed agents, Paul Christiano (2014). Medium. Proposes the following objective for HCAI: Estimate the expected rating a human would give each action if she considered it at length. Take the action with the highest expected rating.
(2) The informed oversight problem, Paul Christiano (2016). The open problem of reinforcing an approval-directed RL agent so that it learns to be robustly aligned at its capability level.

Preference Inference

If we ask a robot to "keep the living room clean", we probably don't want the robot locking everyone out of the house to prevent them from making a mess there (even though that would be a highly effective strategy for the objective, as stated). There seem to be an extremely large number of such implicit common sense rules, which we humans instinctively know or learn from each other, but which are currently difficult for us to codify explicitly in mathematical terms to be implemented by a machine. It may therefore be necessary to specify our preferences to AI systems implicitly, via a procedure whereby a machine will infer our preferences from reasoning and training data. This category highlights research that we think may be helpful in developing methods for that sort of preference inference.
Set your priority threshold for this category:
Textbooks:
, Stephen Boyd, Laurent El Ghaoui, E. Feron, V. Balakrishnan (1994). Studies in Applied Mathematics. Overview of inverse optimal control methods for linear dynamical systems.
Published Articles:
(1) Cooperative Inverse Reinforcement Learning, Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell (2016). NIPS. Proposes having AI systems perform value alignment by playing a cooperative game with the human, where the reward function for the AI is known only to the human.
(2) Algorithms for Inverse Reinforcement Learning, Andrew Ng, Stuart Russell (1998). ICML. Introduces Inverse Reinforcement Learning, gives useful theorems to characterize solutions, and an initial max-margin approach.
(2) Generative Adversarial Imitation Learning, Jonathan Ho, Stefano Ermon (2016). NIPS. Exciting new approach to IRL and learning from demonstrations, that is more robust to adversarial failures in IRL.
(2) Active reward learning, Christian Daniel et al (2014). RSS. Good approach to semi-supervised RL and learning reward functions -- one of the few such papers.
(2) Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, Chelsea Finn, Sergey Levine, Pieter Abeel (2016). Berkeley. Recent and important paper on deep inverse reinforcement learning.
(2) Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel, Andrew Ng (2004). ICML. IRL with linear feature combinations. Introduces matching expected feature counts as an optimality criterion.
, Daniel Dewey (2011). AGI. Introduces preference inference from an RL perspective, contrasting to AIXI.
, Umar Syed, Robert E. Schapire (2007). UAI.
, R.E. Kalman (1964). Journal of Basic Engineering. Seminal paper on inverse optimal control for linear dynamical systems.
, Owain Evans, Andreas Stuhlmuller, Noah D. Goodman (2016). AAAI.
, Owain Evans, Andreas Stuhlmuller, Noah D. Goodman (2015). NIPS.
, Kaj Sotala (2016). AAAI.
, Nathan D. Ratliff, J. Andrew Bagnell, Martin A. Zinkevich (2006). ICML. IRL with linear feature combinations. Gives an IRL approach that can use a black box MDP solver.
Unpublished Articles:
(2) Towards resolving unidentifiability in inverse reinforcement learning, Kareem Amin, Satinder Singh (2016). arXiv. Shows that unidentifiability of reward functions can be mitigated by active inverse reinforcement learning.
, Gopal Sarma, Nick Hay (2016). arXiv. Claims that human values can be decomposed into mammalian values, human cognition, human socio-cultural evolution.
Blog Posts:
(1) Ambitious vs. narrow value learning, Paul Christiano (2015). Medium. Argues that "narrow" value learning is a more scalable and tractable approach to AI control that has sometimes been too quickly dismissed.

Reward Engineering

As reinforcement-learning based AI systems become more general and autonomous, the design of reward mechanisms that elicit desired behaviours becomes both more important and more difficult. For example, if a very intelligent cleaning robot is programmed to maximize a reward signal that measures how clean its house is, it could discover that the best strategy to maximize its reward signal would be to use a computer to reprogram its vision sensors to always display an image of an extremely clean house (a phenomenon sometimes called "reward hacking"). Faced with this and related challenges, how can we design mechanisms that generate rewards for reinforcement learners that will continue to elicit desired behaviors as reinforcement learners become increasingly clever? Human oversight provides some reassurance that reward mechanisms will not malfunction, but will be difficult to scale as systems operate faster and perhaps more creatively than human judgment can monitor. Can we replace expensive human feedback with cheaper proxies, potentially using machine learning to generate rewards?
Set your priority threshold for this category:
Published Articles:
(2) Delusion, Survival, and Intelligent Agents, Ring and Orseau (2011). AGI. RL agents should try to delude themselves by directly modifying their percepts and/or reward signal to get high rewards. So should agents that try to predict well. Knowledge-seeking agents don't.
(2) Avoiding Wireheading with Value Reinforcement Learning, Everitt and Hutter (2016). AGI. RL agents might want to hack their own reward signal to get high reward. Agents which try to optimise some abstract utility function, and use the reward signal as evidence about that, shouldn't.
(2) Interactively shaping agents via human reinforcement: The TAMER framework, Knox, W. Bradley, and Peter Stone (2009). Fifth International Conference on Knowledge Capture . Training a reinforcement learner using a reward signal supplied by a human overseer. At each point, the agent greedily chooses the action that is predicted to have the highest reward.
(2) Policy invariance under reward transformations: Theory and application to reward shaping, Andrew Ng, Daishi Harada, Stuart Russell (1999). Berkeley. Investigates conditions under which modifications to the reward function of an MDP preserve the optimal policy.
, Daniel Dewey (2014). AAAI. Argues that a key problem in HCAI is reward engineering---designing reward functions for RL agents that will incentivize them to take actions that humans actually approve of.
, Shakir Mohammed and Danilo Rezende (2015). NIPS. Mixes deep RL with intrinsic motivation -- anyone trying to study reward hacking or reward design should study intrinsic motivation.
, Bill Hibbard (2011). AGI. Early paper on how to get around wireheading
, Leon Bottou (2015). ICML. Gives some examples of reinforcement learners that find bad equilibria due to feedback loops.
, Omohundro (2008). AGI. Informally argues that advanced AIs should not wirehead, since they will have utility functions about the state of the world, and will recognise wireheading as not really useful for their goals.
, Christoph Salge, Cornelius Glackin, and Daniel Polani (2013). arXiv. Review paper on empowerment, one of the most common approaches to intrinsic motivation. Relevant to anyone who wants to study reward design or reward hacking.
, Stuart Armstrong (2015). AAAI. Discusses case where agents might manipulate the process by which their values are selected.
Blog Posts:
(1) ALBA: An explicit proposal for aligned AI, Paul Christiano (2016). A proposal for training a highly capable, aligned AI system, using approval-directed RL agents and bootstrapping.
(2) ALBA on GitHub, Paul Christiano (2016). Medium. A GitHub repo specifying the details of the ALBA proposal.
(2) Capability amplification, Paul Christiano (2016). The open problem of taking an aligned policy and producing a more effective policy that is still aligned.
(2) Reward engineering, Paul Christiano (2015). Medium.

Robustness

AI systems are already being given significant autonomous decision-making power in high-stakes situations, sometimes with little or no immediate human supervision. It is important that such systems be robust to noisy or shifting environments, misspecified goals, and faulty implementations, so that such perturbations don't cause the system to take actions with catastrophic consequences, such as crashing a car or the stock market. This category lists research that might help with designing AI systems to be more robust in these ways, that might also help with more advanced systems in the longer term.
Set your priority threshold for this category:
Textbooks:
, Vladimir Vovk, Alex Gammerman, Glenn Shafer (2005). Springer. Presents a framework for prediction tasks that expresses underconfidence by giving a set of predicted labels that probably contains the true label.
Published Articles:
(1) A comprehensive survey on safe reinforcement learning, Javier Garcia, Fernando Fernandez (2015). JMLR. Two approaches to ensuring safe exploration of reinforcement learners: adding a safety factor to the optimality criterion, and guiding the exploration process with external knowledge a risk metric.
(1) Intriguing properties of neural networks, Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus (2013). ICLR. Shows that deep neural networks can give very different outputs for very similar inputs, and that semantic info is stored in linear combos of high-level units.
(2) Knows what it knows: a framework for self-aware learning, Lihong Li, Michael L. Littman, Thomas J. Walsh (2008). ICML. Handling predictive uncertainty by maintaining a class of hypotheses consistent with observations, and opting out of predictions if there is conflict among remaining hypotheses.
(2) Weight uncertainty in neural networks, Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra (2015). JMLR. Uses weight distributions in neural nets to manage uncertainty in a quasi-Bayesian fashion. Simple idea but very little work in this area.
(2) Explaining and harnessing adversarial examples, Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy (2015). ICLR. Argues that although models that are somewhat linear are easy to train, they are vulnerable to adversarial examples; for example, neural networks can be very overconfident in their judgments.
(2) Online Learning Survey, Shai Shalev-Schwartz (2011). Foundations and Trends in Machine Learning. Readable introduction to the theory of online learning, including regret, and how to use it to analyze online learning algorithms.
(2) Dropout as a Bayesian approximation: Representing model uncertainty in deep learning, Yarin Gal, Zoubin Ghahramani (2015). arXiv. Connects deep learning regularization techniques (dropout) to bayesian approaches to model uncertainty.
(2) Safe exploration in markov decision processes, Teodor Mihai Moldovan, Pieter Abbeel (2012). ICML. My favorite paper on safe exploration -- learns the dynamics of the MDP as it also learns to act safely.
(2) Quantilizers Limited Optimization, Jessica Taylor (2016). AAAI.
, Anh Nguyen, Jason Yosinski, Jeff Clune (2015). IEEE. Shows that deep neural networks are liable to give very overconfident, wrong classifications to adversarially generated images.
, Dimitris Bertsimas, David B. Brown, and Constantine Caramanis (2011). SIAM. Surveys computationally tractable optimization methods that are robust to perturbations in the parameters of the problem.
, Anh Nguyen, Jason Yosinski, and Jeff Clune (2015). CVPR. Another key paper on adversarial examples.
, Felix Berkenkamp, Andreas Krause, Angela P. Schoellig (2016). arXiv. Approach to known safety constraints in ML systems.
, Sarah M. Loos, David Renshaw, and Andre Platzer (2013). HSCC. Not deep learning or advanced AI, but a good practical example of what it takes to formally verify a real-wold system.
, Laurent Orseau and Stuart Armstrong (2016). UAI. Explores a way to make sure a learning agent will not learn to prevent (or seek) being interrupted by a human operator.
, Daniel Mankowitz, Aviv Tamar, Shie Mannor (2016). EWRL.
, Sinno Jialin Pan and Qiang Yang (2010). IEEE.
, Jacob Steinhardt and Percy Liang (2016). arXiv. Approach to distributional shift that tries to obtain reliable estimates of error while making limited assumptions.
Unpublished Articles:
(1) Concrete problems in AI safety, Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mane (2016). arXiv. Describes five open problems: avoiding negative side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift.
(1) Alignment for Advanced Machine Learning Systems, Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, AC (2016). Describes eight open problems: ambiguity identification, human imitation, informed oversight, environmental goals, conservative concepts, impact measures, mild optimization, and averting incentives.
(2) Adversarial Examples in the Physical World, Alex Kurakin, Ian Goodfellow, and Samy Bengio (2016). arXiv. Surprising finding that adversarial examples work when observed by cell phone camera, even if one isn't directly optimizing it to account for this process.
, Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sastry (2016). arXiv. Lays out challenges and principles in formally specifying and verifying the behavior of AI systems.
Blog Posts:
(2) Learning with catastrophes, Paul Christiano (2016). Presents an approach to training AI systems by to avoid catastrophic mistakes, possibly by adversarially generating potentially catastrophic situations.

Transparency

If an AI system used in medical diagnosis makes a recommendation that a human doctor finds counterintuitive, it is desirable if the AI can explain the recommendation in a way that helps the doctor evaluate whether the recommendation will work. Even if the AI has an impeccable track record, if its decisions are explainable or otherwise transparent to a human, this may help the doctor to avoid rare but serious errors, especially if the AI is well-calibrated as to when the doctor's judgment should override it. Transparency may also be helpful to engineers developing a system, to improve their intuition for what principles (if any) can be ascribed to its decision-making. This gives us a second way to evaluate a system's performance, other than waiting to observe the results it obtains. Transparency may therefore be extremely important in situations where a system will make decisions that could affect the entire world, and where waiting to see the results might not be a practical way to ensure safety. This category lists research that we think may be useful in designing more transparent AI systems.
Set your priority threshold for this category:
Course Materials:
(2) Visualizing what ConvNets learn, Andrej Karpathy. Several approaches for understanding and visualizing Convolutional Networks have been developed in the literature.
Published Articles:
(1) Making machine learning models interpretable, A Vellido, JD Martin-Guerrero, PJG Lisboa (2012). ESANN. Gives short descriptions of various methods for explaining ML models to non-expert users; mainly interesting for its bibliography.
(1) Deep inside convolutional networks: Visualising image classification models and saliency maps, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman (2014). ICLR. Visualizing what features a trained neural network responds to, by generating images the network strongly assigns some label, and mapping which parts of the input the network is sensitive to.
(2) Mythos of Interpretability, Zachary Lipton (2016). ICML.
(2) Visualizing and Understanding Convolutional Networks, Matthew D Zeiler, Rob Fergus (2013). ECVV. A novel ConvNet visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.
(2) Graying the Black Box: Understanding DQNs, Tom Zahavy, Nir Ben Zrihem, and Shie Mannor (2016). JMLR. Use t-SNE to analyze agent learned through Q-Learning.
(2) "Why Should I Trust You?": Explaining the Predictions of Any Classifier., Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin (2016). NAACL. LIME is a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction.
, Viktoriya Krakovna, Finale Doshi-Velez (2016). ICML. Learning high-level Hidden Markov Models of the activations of RNNs.
, Andrew Ko, Brad A. Myers (2004). CHI. Shows how transparency accelerates development: "Whyline reduced debugging time by nearly a factor of 8, and helped programmers complete 40% more tasks."
, D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Muller (2010). JMLR. Explains individual classification decisions locally in terms of the gradient of the classifier.
, Anupam Datta, Shayak Sen, Yair Zick (2016). IEEE. Introduces a very general notion of ''how much'' an input affects the output of a black-box ML classifier, analogous to Shapley value in the attribution of credit in cooperative games.
, Laurens van der Maaten, Geoff Hinton (2008). JMLR. Probably the most popular nonlinear dimensionality reduction technique.
, Nir Ben Zrihem, Tom Zahavy, Shie Mannor (2016). ICML. An approach to visualizing the higher-level decision-making process of an RL agent by finding clusters of similar internal states of the agent's policy.
, Andrea L. Thomaz, Cynthia Breazeal (2006). ICDL. To incorporate guidance from humans, we modify a Q-Learning algorithm, introducing a pre-action phase where a human can bias the learner's ``attention''.
, Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf (2015). ACM. Software allows end users to influence the predictions that machine learning systems make on their behalf.
, M. Robnik-Sikonja and I. Kononenko (2008). IEEE. Kononenko has a long series of papers on explanation for regression models and machine learning models.
, Andrew Ko, Brad A. Myers (2009). CHI. Shows how transparency helps developers: "Whyline users were successful about three times as often and about twice as fast compared to the control group."
, Or Biran and Kathleen McKeown (2014). ICML. Generating natural language justifications to aid a non-expert in trusting a machine learning classifications.
, Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, Jeff Clune (2016). arXiv. A technique for determining what single neurons in a deep net respond to.
, Michael van Lent, William Fisher, Michael Mancuso (2004). AAAI. Describes the AI architecture and associated explanation capability used by a training system developed for the U.S. Army by commercial game developers and academic researchers.
Unpublished Articles:
, Md Amran Siddiqui, Alan Fern, Thomas G. Dietterich, Weng-Keen Wong (2015). arXiv. Explores methods for explaining anomaly detections to an analyst (simulated as a random forest) by revealing a sequence of features and their values; could be made into a UI.
, David Martens and Foster Provost (2013). Examines methods for explaining classifications of text documents. Defines "explanation" as a set of words such that removing those words from the document will change its classicitation.
News and Magazine Articles:
, CMU (2016). phys.org. News article on the prospect of quantifying "how much" various inputs affect the output of an ML classifier.
Blog Posts:
(2) The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy (2015). Karpathy. Good intro to RNNs, showcases amazing generation ability, nice visualization of what the units are doing.
(2) You Say You Want Transparency and Interpretability?, Rayid Ghani (2016). Discusses the challenge of producing machine learning systems that are transparent/interpretable but also not ``gameable'' (in the sense of Goodhart's law).
(2) Transparency in Safety-Critical Systems, Luke Muehlhauser (2013). Overviews notions of transparency for AI and AGI systems, and argues for its value in establishing confidence that a system will behave as intended.
, Alexander Mordvintsev, Christopher Olah, and Mike Tyka (2015). Google. Seminal result on transparency in neural networks -- the origin of deep dream results.
, Chris Olah (2015). Blog about about transparency and user interface in deep neural networks.

Social Science Perspectives

Cognitive science

Some approaches to human-compatible AI involve systems that explicitly model humans' beliefs and preferences. This category lists psychology and cognitive science research that is aimed at developing models of human beliefs and preferences, models of how humans infer beliefs and preferences, and relevant computational modeling background.
Set your priority threshold for this category:
Published Articles:
(2) Bayesian theory of mind: Modeling joint belief-desire attribution, Chris L. Baker, Rebecca R. Saxe, Joshua B. Tenenbaum (2011). IJACSA. Extends the Baker 2009 psychology paper on preference inference by modeling how humans infer both beliefs and desires from observing an agent's actions.
(2) Action Understanding as Inverse Planning, Chris L. Baker, Joshua B. Tenenbaum & Rebecca R. Saxe (2009). Elsevier. The main psychology paper on modeling humans' preference inferences as inverse planning.
(2) Probabilistic Models of Cognition, Noah D. Goodman and Joshua B. Tenenbaum (2016). probmods.org. Interactive web book that teaches the probabilistic approach to cognitive science, including concept learning, causal reasoning, social cognition, and language understanding.
, Joshua D Greene (2015). The University of Chicago Press.
, Julian Jara-Ettinger, Hyowon Gweon, Laura E. Schulz, Joshua B. Tenenbaum (2016). Trends in Cognitive Science .
, Joshua B. Tenenbaum, Charles Kemp, Thomas L. Griffiths, Noah D. Goodman (2011). Science. Reviews the Bayesian cognitive science approach to reverse-engineering human learning and reasoning.
, Tomer D. Ullman, Chris L. Baker, Owen Macindoe, Owain Evans, Noah D. Goodman and Joshua B. Tenenbaum (2009). NIPS.
, Julian Jara-Ettinger, Joshua B. Tenenbaum, Laura E. Schulz (2013). Psychological Science.

Moral Theory

Codifying a precise notion of "good behavior" for an AI system to follow will require (implicitly or explicitly) selecting a particular notion of "good behavior" to codify, and there may be widespread disagreement as to what that notion should be. Many moral questions may arise, including some which were previously considered to be merely theoretical. This category is meant to draw attention to such issues, so that AI researchers and engineers can have them in mind as important test cases when developing their systems.
Set your priority threshold for this category:
Published Articles:
(2) Moral Trade, Toby Ord (2015). Ethics. Argues that moral agents with different morals will engage in trade to increase their moral impact, particularly when they disagree about what is moral.
(2) The Unilateralist's Curse: The Case for a Principle of Conformity, Nick Bostrom, Anders Sandberg, Tom Douglas (2016). Social Epistemology. Illustrates how rational individuals with a common goal have some incentive to act in ways that reliably undermine the group's interests, merely by trusting their own judgement.
(2) Pascal's Mugging, Nick Bostrom (2009). Analysis. Analyzes a thought experiment where an extremely unlikely threat to produce even-more-extremely-large differences in utility might be leveraged to extort resources from an AI.
, Toby Ord, Rafaela Hillerbrand, Anders Sandberg (2013). Journal of Risk Research. Illustrates how model uncertainty should dominate most expert calculations that involve small probabilities.
, Simina Branzei, Ioannis Caragiannis, Jamie Morgenstern, Ariel D. Procaccia (2013). AAAI.
, Fiery Cushman (2013). Society for Personality and Social Psychology.
, Miles Brundage (2014). JETAI. Argues that finding a single solution to machine ethics would be difficult for moral-theoretic reasons and insufficient to ensure ethical machine behaviour.
Unpublished Articles:
, Benya Fallenstein and Nisan Stiennon (2014). MIRI. Describing how having uncertainty over differently-scaled utility functions is equivalent to having uncertainty over same-scaled utility functions with a different distribution.
, Peter de Blanc (2011). Singularity Institute.
Books:
(1) Moral Machines: Teaching robots right from wrong, Wendell Wallach (2009). Oxford University Press. Overviews the engineering problem of translating human morality into machine-implementable format, which implicitly involves settling many philosophical debates about ethics.
(2) The Righteous Mind: Why Good People Are Divided by Politics and Religion, Jonathan Haidt (2012). Random House. Important and comprehensive survey of the flavors of human morality.
Center for Human-Compatible AI