1962-bryson.pdf: “A Steepest-Ascent Method for Solving Optimum Programming Problems”, (1962-06-01; ):
A systematic and rapid steepest-ascent numerical procedure is described for solving two-point boundary-value problems in the calculus of variations for systems governed by a set of nonlinear ordinary differential equations. Numerical examples are presented for minimum time-to-climb and maximum altitude paths for a supersonic interceptor and maximum-range paths for an orbital glider. [Keywords: boundary-value problems, computer programming, differential equations, variational techniques]
…A systematic and rapid steepest-ascent numerical procedure is described for determining optimum programs for nonlinear systems with terminal constraints. The procedure uses the concept of local linearization around a nominal (non-optimum) path. The effect on the terminal conditions of a small change in the control variable program is determined by numerical integration of the adjoint differential equations for small perturbations about the nominal path. Having these adjoint (or influence) functions, it is then possible to determine the change in the control variable program that gives maximum increase in the pay-off function for a given mean-square perturbation of the control variable program while simultaneously changing the terminal quantities by desired amounts. By repeating this process in small steps, a control variable program that minimizes one quantity and yields specified values of other terminal quantities can be approached as closely as desired. Three numerical examples are presented: (a) The angle-of-attack program for a typical supersonic interceptor to climb to altitude in minimum time is determined with and without specified terminal velocity and heading. (b) The angle-of-attack program for the same interceptor to climb to maximum altitude is determined, (c) The angle-of-attack program is determined for a hypersonic orbital glider to obtain maximum surface range starting from satellite speed at 300,000 ft altitude.
1962-harley.pdf: “Untitled” ( )
1962-kelley.pdf: “Method of Gradients”, (1962-01-01):
The method of gradients also known as method of steepest descent is an elementary concept for the solution of minimum problems. In recent years the computational appeal of the method has led to its adoption in a variety of application such as multivariable minimum problems of ordinary calculus, solution of systems of algebraic equations, integral equations, and variational problems. This chapter begins with a discussion of the main features of the gradient method in the context of ordinary minimum problems subject to constraints. It also discusses the variational problems of flight performance, introducing Green’s functions in the role played by partial derivatives in ordinary minimum problems and attempting to preserve an analogy between the two classes of problems in the subsequent development. The close relationship between Green’s functions or influence functions and the error coefficients of guidance theory has drawn attention to the usefulness of the adjoint system technique in guidance analysis.
1963-kelley.pdf: “Singular Extremals In Lawden's Problem Of Optimal Rocket Flight”, (1963-07-01):
The problem of optimal rocket flight in an inverse square law force field has been studied extensively by Lawden and Leitmann. Periods of zero thrust, intermediate thrust, and maximum thrust are possible subarcs of the solution according to analysis of the Euler-Lagrange equations and the Weierstrass necessary condition. Arcs of intermediate thrust have been examined recently by Lawden; however, the question of whether or not such arcs actually may furnish a minimum has been left unresolved. The present paper derives the singular extremals of Lawden’s problem by means of the Legendre-Clebsch necessary condition applied in a transformed system of state and control variables.
These are obtained as circular orbits along which the thrust is zero and intermediate thrust arcs are found in Lawden’s analysis.Since these solutions satisfy only the weak form of the Legendre-Clebsch condition, i.e., the extremals are singular in the transformed system of variables, the question of their minimality remains unanswered.
1964-kanal.pdf: “Untitled” ( )
1966-ivakhnenko.pdf: “Cybernetic Predicting Devices”, (1966-09-23):
[Predicting programs designed for large general-purpose computers constitute an important new tool in the control of production and economics. Nevertheless, small predicting filters have their own domain of application. They can be realized not only as programs for general-purpose computers, but also as simple analog devices with very fast response. The authors discuss three principal methods of prediction in addition to some others. Prediction of deterministic processes, i.e. extrapolation and interpolation. Prediction of stochastic processes, based on statistical prediction theory. Prediction based on adaptation or learning of the predicting filters.]
1977-agrawala-machinerecognitionofpatterns.pdf: “Untitled” ( )
1985-michie.pdf: “Human Window on the World”, Donald Michie ( )
1986-jordan.pdf: “Serial Order: A Parallel Distributed Processing Approach”, (1986-05; ):
A theory of serial order is proposed that attempts to deal both with the classical problem of the temporal organization of internally generated action sequences as well as with certain of the parallel aspects of sequential behavior.
The theory describes a dynamical system that is embodied as a parallel distributed processing or connectionist network. The trajectories of this dynamical system come to follow desired paths corresponding to particular action sequences as a result of a learning process during which constraints are imposed on the system. These constraints enforce sequentiality where necessary and, as they are relaxed, performance becomes more parallel.
The theory is applied to the problem of coarticulation in speech production and simulation experiments are presented.
1986-michie-onmachineintelligence.pdf: “On Machine Intelligence, Second Edition”, Donald Michie ( )
1986-rumelhart-2.pdf: “Learning representations by backpropagating errors”, (1986-10-09; ):
We describe a new learning procedure, backpropagation, for networks of neuron-like units.
The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units.
The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.
1986-rumelhart.pdf: “Learning Internal Representations by Error Propagation”, (1986; ):
This paper presents a generalization of the perception learning procedure for learning the correct sets of connections for arbitrary networks. The rule, called the “generalized delta rule”, is a simple scheme for implementing a gradient descent method for finding weights that minimize the sum squared error of the system’s performance. The major theoretical contribution of the work is the procedure called “error propagation”, whereby the gradient can be determined by individual units of the network based only on locally available information. The major empirical contribution of the work is to show that the problem of local minima is not serious in this application of gradient descent. [Keywords: learning networks perceptrons adaptive systems learning machines, back propagation]
…In their pessimistic discussion of perceptrons, Minsky and Papert (1969) finally discuss multilayer machines near the end of their book. They state:
The perceptron has shown itself worthy of study despite (and even because of!) its severe limitations. It has many features that attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version, Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension is sterile. Perhaps some powerful convergence theorem will be discovered, or some profound reason for the failure to produce an interesting “learning theorem” for the multilayered machine will be found. (pp. 231–232)
Although our learning results do not guarantee that we can find a solution for all solvable problems, our analyses and results have shown that as a practical matter, the error propagation scheme leads to solutions in virtually every case. In short, we believe that we have answered Minsky and Papert’s challenge and have found a learning result sufficiently powerful to demonstrate that their pessimism about learning in multilayer machines was misplaced.
One way to view the procedure we have been describing is as a parallel computer that, having been shown the appropriate input/
output exemplars specifying some function, programs itself to compute that function in general. Parallel computers are notoriously difficult to program. Here we have a mechanism whereby we do not actually have to know how to write the program in order to get the system to do it. Parker (1985) has emphasized this point. [Learning-logic: casting the cortex of the human brain in silicon, (TR-47). Cambridge, MA: Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science]
On many occasions we have been surprised to learn of new methods of computing interesting functions by observing the behavior of our learning algorithm. This also raised the question of generalization. In most of the cases presented above, we have presented the system with the entire set of exemplars. It is interesting to ask what would happen if we presented only a subset of the exemplars at training time and then watched the system generalize to remaining exemplars. In small problems such as those presented here, the system sometimes finds solutions to the problems which do not properly generalize. However, preliminary results on larger problems are very encouraging in this regard. This research is still in progress and cannot be reported here. This is currently a very active interest of ours.
Finally, we should say that this work is not yet in a finished form. We have only begun our study of recurrent networks and sigma-pi units. We have not yet applied our learning procedure to many very complex problems. However, the results to date are encouraging and we are continuing our work.
1987-mcdermott.pdf: “A critique of pure reason”, (1987-02-01):
[1987 retrospective by noted proponent of logic for planning and reasoning in AI (‘GOFAI’); McDermott criticizes his own work fiercely, along with that of his colleagues (particularly John McCarthy, Robert Moore, James Allen, Jerry Hobbs, & Patrick Hayes), describing the ‘logicist’ paradigm—that sufficiently ingenious and persistent application of logical reasoning, mostly first-order logic, can eventually give rise to human-level understanding of the world, planning & execution of actions, and eventually AGI.
McDermott concludes that the nature of such programs is that they are unable to see if they are making real progress (because a failure to infer something could simply reflect a lacking axiom), and worse, that such logics are not even an approximation to what intelligence is, or a role model, or that failures reflect poor choice of axioms, but that logics only verify things and do not compute useful things like plans, and collapse into verifying trivialities which do no useful intellectual work. Resorts to powerful tools like temporal logics or nonmonotonic logics sacrifice the philosophical advantages of logical inference in an attempt to get working systems, but may obtain neither. What is necessary is doing without deduction.]
It must be the case that a substantial portion of the inferences we want [to make] are deductions, or it will simply be irrelevant how many theorems follow deductively from a given axiom set.
…To summarize: The logicist project of expressing “naive physics” in first-order logic has not been very successful. One reason may be that the basic argument was flawed. You cannot write down axioms independent of a program for manipulating them if the inferences you are interested in are not deductions. Unfortunately, very few interesting inferences are deductions, and the attempts by logicists to extend logic to cover more territory have been disappointing. Hence we must resign ourselves to writing programs, and viewing knowledge representations as entities to be manipulated by the programs.
…Finally, I should admit that I am still doing work in the paradigm that I criticize here. In the domain of shape representation, so little is known that focusing on an idealization cannot but help teach us something. The problem I would like to tackle is representing the knowledge required to answer questions like, Could a paper clip be used as a key ring? The idealization I have been forced to fall back on is to prove that a paper clip of a certain size and shape could fit through the hole of a typical key. It should be obvious how much of the original problem this leaves out. Still, the territory is so unexplored that a tour through the idealized fragment could turn up something interesting. What one cannot hope for is to express as logical axioms everything there is to know about using shapes in unusual ways, before designing programs for this task. This will probably come as a shock to no one but me and a few friends.
1987-robinson.pdf: “The Utility Driven Dynamic Error Propagation Network”, (1987-11-04; ):
Error propagation networks are able to learn a variety of tasks in which a static input pattern is mapped onto a static output pattern. This paper presents a generalisation of these nets to deal with time varying, or dynamic, patterns. Three possible architectures are explored which deal with learning sequences of known finite length and sequences of unknown and possibly infinite length. Several examples are given and an application to speech coding is discussed.
A further development of dynamic nets is made which allows them to be trained by a signal which expresses the correctness of the output of the net, the utility signal. On one possible architecture for such utility driven dynamic nets is given and a simple example is presented. Utility driven dynamic nets are potentially able to calculate and maximize any function of the input and output data streams, within the considered context. This is a very powerful property, and an appendix presents a comparison of the information processing in utility driven dynamic nets and that in the human brain.
1988-bachrach.pdf: “A Sticky-Bit Approach for Learning to Represent State”, Jonathan R. Bachrach ( )
1988-langley.pdf: “Machine learning as an experimental science”, Pat Langley ( )
1988-papert.pdf: “One AI or Many?”, Seymour Papert
1989-cliff.pdf: “In memory of Henry J. Kelley”, E. M. Cliff
1989-williams-2.pdf: “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks”, (1989-06-01; ):
The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length.
1989-williams.pdf: “Experimental Analysis of the Real-time Recurrent Learning Algorithm”, (1989):
The real-time recurrent learning algorithm (RTRL) is a gradient-following learning algorithm for completely recurrent networks running in continually sampled time. Here we use a series of simulation experiments to investigate the power and properties of this algorithm. In the recurrent networks studied here, any unit can be connected to any other, and any unit can receive external input. These networks run continually in the sense that they sample their inputs on every update cycle, and any unit can have a training target on any cycle. The storage required and computation time on each step are independent of time and are completely determined by the size of the network, so no prior knowledge of the temporal structure of the task being learned is required. The algorithm is nonlocal in the sense that each unit must have knowledge of the complete recurrent weight matrix and error vector. The algorithm is computationally intensive in sequential computers, requiring a storage capacity of the order of the third power of the number of units and a computation time on each cycle of the order of the fourth power of the number of units. The simulations include examples in which networks are taught tasks not possible with tapped delay lines—that is, tasks that require the preservation of state over potentially unbounded periods of time. The most complex example of this kind is learning to emulate a Turing machine that does a parenthesis balancing problem. Examples are also given of networks that do feedforward computations with unknown delays, requiring them to organize into networks with the correct number of layers. Finally, examples are given in which networks are trained to oscillate in various ways, including sinusoidal oscillation.
1990-dreyfus.pdf: “Artificial neural networks, back propagation, and the Kelley-Bryson gradient procedure”, Stuart E. Dreyfus
1990-elman.pdf: “Finding Structure In Time”, (1990-04-01; ):
Time underlies many interesting human behaviors. Thus, the question of how to represent time in connectionist models is very important. One approach is to represent time implicitly by its effects on processing rather than explicitly (as in a spatial representation).
The current report develops a proposal along these lines first described by Jordan 1986 which involves the use of recurrent links in order to provide networks with a dynamic memory. In this approach, hidden unit patterns are fed back to themselves; the internal representations which develop thus reflect task demands in the context of prior internal states.
A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/
semantic features for words. The networks are able to learn interesting internal representations which incorporate task demands with memory demands; indeed, in this approach the notion of memory is inextricably bound up with task processing. These representations reveal a rich structure, which allows them to be highly context-dependent, while also expressing generalizations across classes of items.
These representations suggest a method for representing lexical categories and the type/
1990-opper.pdf: “On the ability of the optimal perceptron to generalize”, (1990):
A linearly separable Boolean function is derived from a set of examples by a perceptron with optimal stability. The probability to reconstruct a pattern which is not learnt is calculated analytically using the replica method. [see also “double descent”]
1990-schmidhuber.pdf: “A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks”, (1989; ):
Most known learning algorithms for dynamic neural networks in non-stationary environments need global computations to perform credit assignment. These algorithms either are not local in time or not local in space. Those algorithms which are local in both time and space usually cannot deal sensibly with ‘hidden units’. In contrast, as far as we can judge, learning rules in biological systems with many ‘hidden units’ are local in both space and time. In this paper we propose a parallel on-line learning algorithms which performs local computations only, yet still is designed to deal with hidden units and with units whose past activations are ‘hidden in time’. The approach is inspired by Holland’s idea of the bucket brigade for classifier systems, which is transformed to run on a neural network with fixed topology. The result is a feedforward or recurrent ‘neural’ dissipative system which is consuming ‘weight-substance’ and permanently trying to distribute this substance onto its connections in an appropriate way. Simple experiments demonstrating the feasibility of the algorithm are reported.
1990-schwartz.pdf: “Exhaustive Learning”, (1990-09-01):
Exhaustive exploration of an ensemble of networks is used to model learning and generalization in layered neural networks. A simple Boolean learning problem involving networks with binary weights is numerically solved to obtain the entropy Sm and the average generalization ability Gm as a function of the size m of the training set. Learning curves Gm vs m are shown to depend solely on the distribution of generalization abilities over the ensemble of networks. Such distribution is determined prior to learning, and provides a novel theoretical tool for the prediction of network performance on a specific task.
1991-hochreiter.pdf: “Untersuchungen zu dynamischen neuronalen Netzen [Studies of dynamic neural networks]”, (1991-06-15; ):
[GPT-3 translation of German abstract]
Since the seminal article by Williams, Hinton, and Rumelhart [RHW86], backpropagation (BP) has become very popular as a learning method for neural networks with and without feedback. In contrast to many other learning methods for neural networks, BP takes into account the network structure and improves the network on the basis of this knowledge.
Since a very remote past input has to influence the present output, if it is randomly selected, this input is very unlikely to influence the present state of the network. Hence BP algorithms do not detect the fact that this input is responsible for the output desired. Therefore, BP algorithms are very hard to train a network to remember an input until it is needed to produce a later output. Moreover, the public BP algorithms take a very long time to compute.
In many cases, though, one needs an input sequence, as in Mozer [Moz90], which learns to compose music, where musical pieces are repeated and later note pitches are determined by previous note pitches. Steers a vehicle in a labyrinth, and the network obtains the error information only if the vehicle is in a dead end, so back-propagated errors are needed. If a neural network controls a robot that performs a task, perhaps some preparatory tasks are necessary whose performance the system should remember.
In this work, we investigate how to approach the problem of the long learning time associated with net-work inputs that are used later to control a desired output. This can be done either by means of the net-work architecture or by using the structure of the input sequences. In Chapter 4, a network is built so that inputs that are received at long delays are considered better than in the usual network architecture. Here, ‘storage nodes’ are introduced, which can carry information about an arbitrarily long time interval. The shortening of the input sequences, while retaining all relevant information, is investigated in Chapter 3. When a shortened input sequence must be recognized within this not so far back into the past, to recognize the relevant inputs. In Chapter 1 the used BP-Learn algorithms are presented, which are then in Chapter 2 analyzed to determine the cause of the long learning time to learn to store past inputs. To the algorithms it should be said that in some cases these were slightly modified to save computational time. The problem of resource acquisition occurring in Chapter 3 and 4 methods is addressed in Chapter 5.
The described experiments were performed on Sparc-based SUN stations. Due to time-resource constraints, algorithm comparison tests could not be carried out in the desired extent. There were trials that ran for up to a week on these machines, but other processes with higher priority were also running on these machines.
The definitions and notations in this work are not those commonly used in studies of neural networks, but they are introduced here only for this work. The reason is that there are no uniform, fundamental definitions for neural networks on which other authors would have based their work. Therefore, it is not guaranteed that there are no inconsistencies in the definitions and notations with other works.
These results were not all mathematically proven, as the work does not claim to be a mathematical analysis of neural networks. It is also difficult to find simple mathematical formalisms for neural networks. The work will rather describe ideas and approaches to see if it is possible to get a better grip on the problem of the long learning time for important previous inputs.
Besides the methods described here for learning in non-static environments, there is also the approach of the “Adaptive Critic”, as described in [Sch90a] and [Sch90c [“Recurrent networks adjusted by adaptive critics”]]. The approach of “fast weights” by Schmidhuber [Sch91b] founds a storage function, although with a completely different approach than in Chapter 4, where a storage is also constructed.
1991-schmidhuber.pdf: “Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks”, (1992; ):
Previous algorithms for supervised sequence learning are based on dynamic recurrent networks. This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: The first net learns to produce context-dependent weight changes for the second net whose weights may vary very quickly. The method offers the potential for STM storage efficiency: A single weight (instead of a full-fledged unit) may be sufficient for storing temporal information. Various learning methods are derived. Two experiments with unknown time delays illustrate the approach. One experiment shows how the system can be used for adaptive temporary variable binding.
1991-sethi-artificialneuralnetworksandstatisticalpatternrecognition.pdf: “Untitled” ( )
1991-shavlik.pdf: “Symbolic and neural learning algorithms: An experimental comparison”, (1991-03-01; ):
Despite the fact that many symbolic and neural network (connectionist) learning algorithms address the same problem of learning from classified examples, very little is known regarding their comparative strengths and weaknesses.
Experiments comparing the ID3 symbolic learning algorithm with the perception and backpropagation neural learning algorithms have been performed using 5 large, real-world data sets.
Overall, backpropagation performs slightly better than the other 2 algorithms in terms of classification accuracy on new examples, but takes much longer to train. Experimental results suggest that backpropagation can work statistically-significantly better on data sets containing numerical data.
Also analyzed empirically are the effects of (1) the amount of training data, (2) imperfect training examples, and (3) the encoding of the desired outputs.
Backpropagation occasionally outperforms the other 2 systems when given relatively small amounts of training data. It is slightly more accurate than ID3 when examples are noisy or incompletely specified. Finally, backpropagation more effectively utilizes a “distributed” output encoding. [Keywords: empirical learning, connectionism, neural networks, inductive learning, ID3, perceptron, backpropagation]
1992-levy-artificiallife.pdf: “Artificial Life: A Report from the Frontier Where Computers Meet Biology”, Steven Levy ( )
1993-amari.pdf: “Statistical Theory of Learning Curves under Entropic Loss Criterion”, (1993-01-01):
The present paper elucidates a universal property of learning curves, which shows how the generalization error, training error, and the complexity of the underlying stochastic machine are related and how the behavior of a stochastic machine is improved as the number of training examples increases. The error is measured by the entropic loss. It is proved that the generalization error converges to H0, the entropy of the conditional distribution of the true machine, as H0 + m*/
(2t), while the training error converges as H0—m*/ (2t), where t is the number of examples and m* shows the complexity of the network. When the model is faithful, implying that the true machine is in the model, m* is reduced to m, the number of modifiable parameters. This is a universal law because it holds for any regular machine irrespective of its structure under the maximum likelihood estimator. Similar relations are obtained for the Bayes and Gibbs learning algorithms. These learning curves show the relation among the accuracy of learning, the complexity of a model, and the number of training examples.
1993-cortes.pdf: “Learning Curves: Asymptotic Values and Rate of Convergence”, (1993; ):
Training classifiers on large databases is computationally demanding. It is desirable to develop efficient procedures for a reliable prediction of a classifier’s suitability for implementing a given task, so that resources can be assigned to the most promising candidates or freed for exploring new classifier candidates.
We propose such a practical and principled predictive method. Practical because it avoids the costly procedure of training poor classifiers on the whole training set, and principled because of its theoretical foundation.
The effectiveness of the proposed procedure is demonstrated for both single-and multi-layer networks.
1993-harth-thecreativeloop.pdf: “Untitled” ( )
1993-watkin.pdf: “The statistical mechanics of learning a rule”, (1993-04-01; ):
A summary is presented of the statistical mechanical theory of learning a rule with a neural network, a rapidly advancing area which is closely related to other inverse problems frequently encountered by physicists. By emphasizing the relationship between neural networks and strongly interacting physical systems, such as spin glasses, the authors show how learning theory has provided a workshop in which to develop new, exact analytical techniques.
1995-breiman.pdf: “Reflections After Refereeing Papers for NIPS”, (1995-01-01; ):
The theoretical work by Weiner and others on the spectral analysis of stationary time series penetrated statistics following Tukey’s heuristic work on estimation of the spectrum. In refereeing papers for NIPS the author was struck by the growing emphasis on mathematical theory. Mathematical theory is not critical to the development of machine learning. In machine learning, the current panacea is a sigmoid network fitted using backpropagation. The pi-method, for approximating functions using noisy data, was suggested by results in mathematical approximation theory. In spite of intense activity, none of the work has had any effect on the day-to-day practice of statistics, or even on present-day theory. The useful theories was not meant to be inclusive, but even a more inclusive list would be very short. A possible reason is that it is difficult to formulate reasonable analytic models for complex data.
…Uses Of Theory
- Comfort: We knew it worked, but it’s nice to have a proof.
- Insight: Aha! So that’s why it works.
- Innovation: At last, a mathematically proven idea that applies to data.
- Suggestion: Something like this might work with data.
…Our fields would be better off with far fewer theorems, less emphasis on faddish stuff, and much more scientific inquiry and engineering. But the latter requires real thinking. For instance, there are many important questions regarding neural networks which are largely unanswered. There seem to be conflicting stories regarding the following issues:
- Why don’t heavily parameterized neural networks overfit the data?
- What is the effective number of parameters?
- Why doesn’t backpropagation head for a poor local minima?
- When should one stop the backpropagation and use the current parameters?
It makes research more interesting to know that there is no one universally best method. What is best is data dependent. Sometimes “least glamorous” methods such as nearest neighbor are best. We need to learn more about what works best where. But emphasis on theory often distracts us from doing good engineering and living with the data.
1995-mozer.pdf: “A Focused Backpropagation Algorithm for Temporal Pattern Recognition”, (1995; ):
Time is at the heart of many pattern recognition tasks (e.g., speech recognition). However, connectionist learning algorithms to date are not well-suited for dealing with time-varying input patterns.
This chapter introduces a specialized connectionist architecture and corresponding specialization of the backpropagation learning algorithm that operates efficiently, both in computational time and space requirements, on temporal sequences. The key feature of the architecture is a layer of self-connected hidden units that integrate their current value with the new input at each time step to construct a static representation of the temporal input sequence. This architecture avoids two deficiencies found in the backpropagation unfolding-in-time procedure (Rumelhart, Hinton, & Williams, 1986) for handing sequence recognition tasks: first, it reduces the difficulty of temporal credit assignment by focusing the backpropagated error signal; second, it eliminates the need for a buffer to hold the input sequence and/
or intermediate activity levels. The latter property is due to the fact that during the forward (activation) phase, incremental activity traces can be locally computed that hold all information necessary for backpropagation in time.
It is argued that this architecture should scale better than conventional recurrent architectures with respect to sequence length. The architecture has been used to implement a temporal version of Rumelhart and McClelland’s (1986) verb past-tense model. The hidden units learn to behave something like Rumelhart and McClelland’s “Wickelphones,” a rich and flexible representation of temporal information
1996-haussler.pdf: “Rigorous Learning Curve Bounds from Statistical Mechanics”, (1996-11-01; ):
In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics.
The advantage of our theory over the well-established Vapnik-Chervonenkis theory is that our bounds can be considerably tighter in many cases, and are also more reflective of the true behavior of learning curves.
This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory.
The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes.
We illustrate our results with many concrete examples of learning curve bounds derived from our theory. [Keywords: learning curves, statistical mechanics, phase transitions, VC dimension]
1996-kohavi.pdf: “Scaling up the accuracy of Naive–Bayes classifiers: a decision–tree hybrid”, (1996-08-01; ):
Naive-Bayes induction algorithms were previously shown to be surprisingly accurate on many classification tasks even when the conditional independence assumption on which they are based is violated. However, most studies were done on small databases.
We show that in some larger databases, the accuracy of Naive-Bayes does not scale up as well as decision trees.
We then propose a new algorithm, NBTree, which induces a hybrid of decision-tree classifiers and Naive-Bayes classifiers: the decision-tree nodes contain univariate splits as regular decision-trees, but the leaves contain Naive-Bayesian classifiers. The approach retains the interpretability of Naive-Bayes and decision trees, while resulting in classifiers that frequently outperform both constituents, especially in the larger databases tested.
1996-opper.pdf: “Statistical Mechanics of Generalization”, (1996):
We estimate a neural network’s ability to generalize from examples using ideas from statistical mechanics. We discuss the connection between this approach and other powerful concepts from mathematical statistics, computer science, and information theory that are useful in explaining the performance of such machines. For the simplest network, the perceptron, we introduce a variety of learning problems that can be treated exactly by the replica method of statistical physics.
1997-domingos.pdf: “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss”, (1997-11-01; ):
The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored.
Empirical results showing that it performs surprisingly well in many domains containing clear attribute dependences suggest that the answer to this question may be positive.
This article shows that, although the Bayesian classifier’s probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zero-one loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadratic-loss optimality of the Bayesian classifier is in fact a second-order infinitesimal fraction of the region of zero-one optimality.
This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption.
Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain.
This article’s results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically. [Keywords: Simple Bayesian classifier, naive Bayesian classifier, zero-one loss, optimal classification, induction with attribute dependences]
1997-hochreiter-2.pdf: “Flat Minima”, (1997-01-01; ):
We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a “flat” minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to “simple” networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a “good” weight prior. Instead we have a prior over input output functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation’s order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and “optimal brain surgeon/
optimal brain damage.”
1997-hochreiter.pdf: “Long Short-Term Memory”, (1997-12-15; ):
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter’s (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM).
Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is 𝒪(1).
Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
1997-oates.pdf: “The Effects of Training Set Size on Decision Tree Complexity”, (1997; ):
This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no substantial increase in classification accuracy. Said differently, removing randomly selected training instances often results in trees that are substantially smaller and just as accurate as those built on all available training instances.
This implies that decreases in tree size obtained by more sophisticated data reduction techniques should be decomposed into 2 parts: that which is due to reduction of training set size, and the remainder, which is due to how the method selects instances to discard.
We perform this decomposition for one recent data reduction technique, John’s ROBUSTC4.5 (John 1995), and show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set.
We conclude that random data reduction is a baseline against which more sophisticated data reduction techniques should be compared. Finally, we examine one possible cause of the pathological relationship between tree size and training set size.
1999-brain.pdf: “On The Effect of Data Set Size on Bias And Variance in Classification Learning”, (1999; ):
With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. However, machine learning evolved in a different research context to that in which it now finds itself employed. A particularly important problem in the data mining world is working effectively with large data sets. However, most machine learning research has been conducted in the context of learning from very small data sets.
To date most approaches to scaling up machine learning to large data sets have attempted to modify existing algorithms to deal with large data sets in a more computationally efficient and effective manner. But is this necessarily the best method?
This paper explores the possibility of designing algorithms specifically for large data sets. Specifically, the paper looks at how increasing data set size affects bias and variance error decompositions for classification algorithms.
Preliminary results of experiments to determine these effects are presented, showing that, as hypothesized variance can be expected to decrease as training set size increases. No clear effect of training set size on bias was observed.
These results have profound implications for data mining from large data sets, indicating that developing effective learning algorithms for large data sets is not simply a matter of finding computationally efficient variants of existing learning algorithms.
1999-provost-2.pdf: “Efficient Progressive Sampling”, (1999-08-01; ):
Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size rarely is obvious.
We analyze methods for progressive sampling—using progressively larger samples as long as model accuracy improves. We explore several notions of efficient progressive sampling.
We analyze efficiency relative to induction with all instances; we show that a simple, geometric sampling schedule is asymptotically optimal, and we describe how best to take into account prior expectations of accuracy convergence.
We then describe the issues involved in instantiating an efficient progressive sampler, including how to detect convergence. Finally, we provide empirical results comparing a variety of progressive sampling methods. We conclude that progressive sampling can be remarkably efficient.
1999-provost.pdf: “A Survey of Methods for Scaling Up Inductive Algorithms”, (1999-06-01; ):
One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms.
We concentrate on algorithms that build decision trees and rule sets, in order to provide focus and specific details; the issues and techniques generalize to other types of data mining.
We begin with a discussion of important issues related to scaling up. We highlight similarities among scaling techniques by categorizing them into 3 main approaches. For each approach, we then describe, compare, and contrast the different constituent techniques, drawing on specific examples from published papers.
Finally, we use the preceding analysis to suggest how to proceed when dealing with a large problem, and where to focus future research.
2001-banko.pdf#microsoft: “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, (2001-07-01; ):
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
…We collected a 1-billion-word training corpus from a variety of English texts, including news articles, scientific abstracts, government transcripts, literature and other varied forms of prose. This training corpus is three orders of magnitude greater than the largest training corpus previously used for this problem. We used 1 million words of Wall Street Journal text as our test set, and no data from the Wall Street Journal was used when constructing the training corpus. Each learner was trained at several cutoff points in the training corpus, i.e. the first one million words, the first five million words, and so on, until all one billion words were used for training. In order to avoid training biases that may result from merely concatenating the different data sources to form a larger training corpus, we constructed each consecutive training corpus by probabilistically sampling sentences from the different sources weighted by the size of each source.
In Figure 1, we show learning curves for each learner, up to one billion words of training data. Each point in the graph is the average performance over ten confusion sets for that size training corpus. Note that the curves appear to be log-linear even out to one billion words.
2001-ng.pdf: “On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes”, (2001; ):
We compare discriminative and generative learning as typified by logistic regression and naive Bayes.
We show, contrary to a widely-held belief that discriminative classifiers are almost always to be preferred, that there can often be 2 distinct regimes of performance as the training set size is increased, one in which each algorithm does better.
This stems from the observation—which is borne out in repeated experiments—that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster.
2002-behr.pdf: “Estimating and Comparing Entropy across Written Natural Languages Using PPM Compression”, (2002; ):
Previous work on estimating the entropy of written natural language has focused primarily on English. We expand this work by considering other natural languages, including Arabic, Chinese, French, Greek, Japanese, Korean, Russian, and Spanish. We present the results of PPM compression on machine-generated and human-generated translations of texts into various languages. Under the assumption that languages are equally expressive, and that PPM compression does well across languages, one would expect that translated documents would compress to approximately the same size. We verify this empirically on a novel corpus of translated documents. We suggest as an application of this finding using the size of compressed natural language texts as a mean of automatically testing translation quality.
2002-bixby.pdf: “Solving Real-World Linear Programs: A Decade and More of Progress”, (2002-02-01; ):
This paper is an invited contribution to the 50th anniversary issue of the journal Operations Research, published by the Institute of Operations Research and Management Science (INFORMS). It describes one person’s perspective on the development of computational tools for linear programming. The paper begins with a short personal history, followed by historical remarks covering the some 40 years of linear-programming developments that predate my own involvement in this subject. It concludes with a more detailed look at the evolution of computational linear programming since 1987.
…In this paper I have focused primarily on one issue, solving larger, more difficult linear programs faster. The numbers presented speak for themselves. 3 orders of magnitude in machine speed and 3 orders of magnitude in algorithmic speed add up to six orders of magnitude in solving power: A model that might have taken a year to solve 10 years ago can now solve in less than 30 seconds. Of course, no one waits 1 year to solve a model, at least no one I know. The real meaning of such an advance is much harder to measure in practice, but it is real nevertheless. There is no doubt that we now have optimization engines at our disposal that dwarf what was available only a few years ago, making possible the solution of real-world models once considered intractable, and opening up whole new domains of application.
How do these speed improvements fit into the overall picture of linear-programming practice? They are only a part of that picture, though an essential, enabling part. The pervasive availability of powerful, usable desktop computing, the availability of data to feed our models, and the emergence of algebraic modeling languages to represent our models have all combined with the underlying engines to make operations research and linear programming the powerful tools they are today. However, there are still important issues to be solved. In spite of all the advances, the application of linear programming remains primarily the domain of experts. The need for abstraction still stands as a hurdle between technology and solutions. While the existence of this hurdle is disconcerting, it is at least gratifying to know that the benefits from overcoming it are now greater than ever.
2002-hedberg.pdf: “DART: Revolutionizing logistics planning”, Sara Reese Hedberg ( )
2003-perlich.pdf: “Tree Induction vs. Logistic Regression: A Learning–Curve Analysis”, (2003-06-01; ):
We present a large-scale experimental comparison of logistic regression and tree induction (C4.5), assessing classification accuracy and the quality of rankings based on class-membership probabilities.
We use a learning-curve analysis to examine the relationship of these measures to the size of the training set.
The results of the study show several things:
- Contrary to some prior observations, logistic regression does not generally outperform tree induction.
- More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (that is, the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves.
- Contrary to conventional wisdom, tree induction is effective at producing probability-based rankings, although apparently comparatively less so for a given training-set size than at making classifications. Finally,
- the domains on which tree induction and logistic regression are ultimately preferable can be characterized surprisingly well by a simple measure of the separability of signal from noise. [Keywords: decision trees, learning curves, logistic regression, ROC analysis, tree induction]
…The average data-set size is larger than is usual in machine-learning research, and we see behavioral characteristics that would be overlooked when comparing algorithms only on smaller data sets (such as most in the UCI repository; see Blake & Merz 2000).
…Papers such as this seldom consider carefully the size of the data sets to which the algorithms are being applied. Does the relative performance of the different learning methods depend on the size of the data set?
More than a decade ago in machine learning research, the examination of learning curves was commonplace (see, for example, Kibler & Langley 1988, but usually on single data sets (notable exceptions being the study by Shavlik et al 1991, and the work of Catlett 1991 [“Megainduction: machine learning on very large databases”]). Now learning curves are presented only rarely in comparisons of learning algorithms. Learning curves also are found in the statistical literature (Flury & Schmid 1994) and in the neural network literature (Cortes et al 1994). They have been analyzed theoretically, using statistical mechanics (Watkin et al 1993; Haussler et al 1996.
The few cases that exist draw conflicting conclusions, with respect to our goals. Domingos & Pazzani 1997 compare classification-accuracy learning curves of naive Bayes and the C4.5RULES rule learner (Quinlan 1993). On synthetic data, they show that naive Bayes performs better for smaller training sets and C4.5RULES performs better for larger training sets (the learning curves cross). They discuss that this can be explained by considering the different bias/
variance profile of the algorithms for classification (zero/ one loss). Roughly speaking,4 variance plays a more critical role than estimation bias when considering classification accuracy. For smaller data sets, naive Bayes has a substantial advantage over tree or rule induction in terms of variance. They show that this is the case even when (by their construction) the rule learning algorithm has no bias. As expected, as larger training sets reduce variance, C4.5RULES approaches perfect classification. Brain & Webb 1999 perform a similar bias/ variance analysis of C4.5 and naive Bayes. They do not examine whether the curves cross, but do show on 4 UCI data sets that variance is reduced consistently with more data, but bias is not. These results do not directly examine logistic regression, but the bias/ variance arguments do apply: logistic regression, a linear model, should have higher bias but lower variance than tree induction. Therefore, one would expect that their learning curves might cross.
However, the results of Domingos & Pazzani 1997 were generated from synthetic data where the rule learner had no bias. Would we see such behavior on real-world domains? Kohavi 1996 shows classification-accuracy learning curves of tree induction (using C4.5) and of naive Bayes for 9 UCI data sets. With only one exception, either naive Bayes or tree induction dominates (that is, the performance of one or the other is superior consistently for all training-set sizes). Furthermore, by examining the curves, Kohavi concludes that “In most cases, it is clear that even with much more data, the learning curves will not cross” (pp. 203–204).
We are aware of only one learning-curve analysis that compares logistic regression and tree induction. Harris-Jones & Haines 1997 [“Sample size and misclassification: is more always better?”] compare them on 2 business data sets, one real and one synthetic. For these data the learning curves cross, suggesting (as they observe) that logistic regression is preferable for smaller data sets and tree induction for larger data sets. Our results generally support this conclusion.
…These results concur with recent results (Ng & Jordan 2001) comparing discriminative and generative versions of the same model (viz., logistic regression and naive Bayes), which show that learning curves often cross…A corollary observation is that even for very large data-set sizes, the slope of the learning curves remains distinguishable from zero. Catlett 1991 concluded that learning curves continue to grow, on several large-at-the-time data sets (the largest with fewer than 100,000 training examples).14 Provost & Kolluri 1999 suggest that this conclusion should be revisited as the size of data sets that can be processed (feasibly) by learning algorithms increases. Our results provide a contemporary reiteration of Catlett’s. On the other hand, our results seemingly contradict conclusions or assumptions made in some prior work. For example, Oates & Jensen 1997 conclude that classification-tree learning curves level off, and Provost et al 1999 replicate this finding and use it as an assumption of their sampling strategy. Technically, the criterion for a curve to have reached a plateau in these studies is that there be less than a certain threshold (<1%) increase in accuracy from the accuracy with the largest data-set size; however, the conclusion often is taken to mean that increases in accuracy cease. Our results show clearly that this latter interpretation is not appropriate even for our largest data-set sizes.
2003-simard.pdf#microsoft: “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”, (2003; ):
Neural networks are a powerful technology for classification of visual inputs arising from documents. However, there is a confusing plethora of different neural network methods that are used in the literature and in industry.
This paper describes a set of concrete best practices that document analysis researchers can use to get good results with neural networks.
The most important practice is getting a training set as large as possible: we expand the training set by adding a new form of distorted data.
The next most important practice is that convolutional neural networks are better suited for visual document tasks than fully connected networks. We propose that a simple “do-it-yourself” implementation of convolution with a flexible architecture is suitable for many visual document problems. This simple convolutional neural network does not require complex methods, such as momentum, weight decay, structure-dependent learning rates, averaging layers, tangent prop, or even finely-tuning the architecture.
The end result is a very simple yet general architecture which can yield state-of-the-art performance for document analysis.
We illustrate our claims on the MNIST set of English digit images.
2006-chatelain.pdf: “Extraction de séquences numériques dans des documents manuscrits quelconques”, (2006-12-05; ):
Within the framework of the automatic processing of incoming mail documents, we present in this thesis the conception and development of a numerical field extraction system in weakly constrained handwritten documents.
Although the recognition of isolated handwritten entities can be considered as a partially solved problem, the extraction of information in images of complex and free-layout documents is still a challenge. This problem requires the implementation of both handwriting recognition and information extraction methods inspired by approaches developed within the field of information extraction in electronic documents.
Our contribution consists in the conception and the implementation of 2 different strategies: the first extends classical handwriting recognition methods, while the second is inspired from approaches used within the field of information extraction in electronic documents.
The results obtained on a real handwritten mail database show that our second approach is substantially better.
Finally, a complete, generic and efficient system is produced, answering one of the emergent perspectives in the field of the automatic reading of handwritten documents: the extraction of complex information in images of documents. [Text of paper is in French.]
2006-friston.pdf: “A free energy principle for the brain”, (2006-07-01; ):
By formulating Helmholtz’s ideas about perception, in terms of modern-day theories, one arrives at a model of perceptual inference and learning that can explain a remarkable range of neurobiological facts: using constructs from statistical physics, the problems of inferring the causes of sensory input and learning the causal structure of their generation can be resolved using exactly the same principles. Furthermore, inference and learning can proceed in a biologically plausible fashion. The ensuing scheme rests on Empirical Bayes and hierarchical models of how sensory input is caused. The use of hierarchical models enables the brain to construct prior expectations in a dynamic and context-sensitive fashion. This scheme provides a principled way to understand many aspects of cortical organisation and responses.
In this paper, we show these perceptual processes are just one aspect of emergent behaviours of systems that conform to a free energy principle. The free energy considered here measures the difference between the probability distribution of environmental quantities that act on the system and an arbitrary distribution encoded by its configuration. The system can minimise free energy by changing its configuration to affect the way it samples the environment or change the distribution it encodes. These changes correspond to action and perception respectively and lead to an adaptive exchange with the environment that is characteristic of biological systems. This treatment assumes that the system’s state and structure encode an implicit and probabilistic model of the environment. We will look at the models entailed by the brain and how minimisation of its free energy can explain its dynamics and structure. [Keywords: Variational Bayes, free energy, inference, perception, action, learning, attention, selection, hierarchical]
2007-brants.pdf#google: “Large Language Models in Machine Translation”, (2007-06; ):
This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large datasets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.
2007-elson.pdf#microsoft: “Asirra: a CAPTCHA that exploits interest–aligned manual image categorization”, (2007-10-01; ):
We present Asirra (Figure 1), a CAPTCHA that asks users to identify cats out of a set of 12 photographs of both cats and dogs.
Asirra is easy for users; user studies indicate it can be solved by humans 99.6% of the time in under 30 seconds. Barring a major advance in machine vision, we expect computers will have no better than a 1/
54,000 chance of solving it. Asirra’s image database is provided by a novel, mutually beneficial partnership with Petfinder.com. In exchange for the use of their 3 million images, we display an “adopt me” link beneath each one, promoting Petfinder’s primary mission of finding homes for homeless animals.
We describe the design of Asirra, discuss threats to its security, and report early deployment experiences. We also describe 2 novel algorithms for amplifying the skill gap between humans and computers that can be used on many existing CAPTCHAs.
2008-golle.pdf: “Machine learning attacks against the Asirra CAPTCHA”, (2008-10-01; ):
The Asirra CAPTCHA [EDHS2007], proposed at ACM CCS 2007, relies on the problem of distinguishing images of cats and dogs (a task that humans are very good at). The security of Asirra is based on the presumed difficulty of classifying these images automatically.
In this paper, we describe a classifier which is 82.7% accurate in telling apart the images of cats and dogs used in Asirra. This classifier is a combination of support-vector machine classifiers trained on color and texture features extracted from images. Our classifier allows us to solve a 12-image Asirra challenge automatically with probability 10.3%. This probability of success is statistically-significantly higher than the estimate of 0.2% given in [EDHS2007] for machine vision attacks. Our results suggest caution against deploying Asirra without safeguards.
We also investigate the impact of our attacks on the partial credit and token bucket algorithms proposed in [EDHS2007]. The partial credit algorithm weakens Asirra considerably and we recommend against its use. The token bucket algorithm helps mitigate the impact of our attacks and allows Asirra to be deployed in a way that maintains an appealing balance between usability and security. One contribution of our work is to inform the choice of safeguard parameters in Asirra deployments. [Keywords: CAPTCHA, reverse Turing test, machine learning, support vector machine, classifier.]
…Our classifier is a combination of 2 support-vector machine  (SVM) classifiers trained on color and texture features of images. The classifier is entirely automatic, and requires no manual input other than the one-time labelling of training images. Using 15,760 color features, and 5,000 texture features per image, our classifier is 82.7% accurate. The classifier was trained on a commodity PC, using 13,000 labeled images of cats and dogs downloaded from the Asirra website .
2008-omohundro.pdf: “The Basic AI Drives”, (2008-06-01; ):
One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways.
We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted.
We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves.
We then show that self-improving systems will be driven to clarify their goals and represent them as economic utility functions. They will also strive for their actions to approximate rational economic behavior. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. We also discuss some exceptional systems which will want to modify their utility functions.
We next discuss the drive toward self-protection which causes systems try to prevent themselves from being harmed. Finally we examine drives toward the acquisition of resources and toward their efficient utilization.
We end with a discussion of how to incorporate these insights in designing intelligent technology which will lead to a positive future for humanity.
2009-halevy.pdf: “The Unreasonable Effectiveness of Data”, (2009-03-24; ):
At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.
…For many tasks, words and word combinations provide all the representational machinery we need to learn from text.
…So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do
2012-bottou.pdf: “The Tradeoffs of Large-Scale Learning”, (2007/
This chapter develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation-estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the under-lying optimization algorithm in non-trivial ways. For instance, a mediocre optimization algorithm, stochastic gradient descent, is shown to perform very well on large-scale learning problems.
…This chapter develops the ideas initially proposed by Bottou & Bousquet 2008 [“The tradeoffs of large scale learning”, NIPS 2007]. Section 13.2 proposes a decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well-known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. Section 13.3 explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate of the statistical estimation error. Section 13.4 reports experimental results supporting this analysis.
…These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the objective function and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences:
- The SGD and 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (13.11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error.
- Second-order algorithms bring few asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all 4 algorithms are dominated by the polynomial term in (1⁄ε). However, there are important variations in the influence of the constants d, κ, and ν.These constants are very important in practice.
- Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite showing the worst optimization performance on the empirical cost. This phenomenon has already been described and observed in experiments (eg Bottou & Le Cun 2004).
In contrast, since the optimization error εopt of small-scale learning systems can be reduced to insignificant levels, their generalization performance is determined solely by the statistical properties of the objective function.
…Figure 13.1 shows how much time each algorithm takes to reach a given optimization accuracy. The superlinear algorithm TRON reaches the optimum with 10 digits of accuracy in less than one minute. The stochastic gradient starts more quickly but is unable to deliver such a high accuracy. The upper part of the figure clearly shows that the testing set loss stops decreasing long before the superlinear algorithm overcomes the SGD algorithm.
Figure 13.2 shows how the testing loss evolves with the training time. The stochastic gradient descent curve can be compared with the curves obtained using conjugate gradients on subsets of the training examples with increasing sizes. Assume, for instance, that our computing time budget is 1 second. Running the conjugate gradient algorithm on a random subset of 30,000 training examples achieves a much better performance than running it on the whole training set. How to guess the right subset size a priori remains unclear. Meanwhile, running the SGD algorithm on the full training set reaches the same testing set performance much faster.
…Conclusion: Taking into account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the objective function and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms.
This framework leaves room for considerable refinements. Shalev-Shwartz & Srebro 2008 rigorously extend the analysis to regularized risk formulations with linear parameterization and find again that, for learning purposes, SGD algorithms are often more attractive than standard primal or dual algorithms with good optimization complexity (Joachims 2006; Hush et al 2006). It could also be interesting to investigate how the choice of a surrogate loss function (Zhang 2004; Bartlett et al 2006) impacts the large-scale case.
2012-ciresan.pdf: “Deep Big Multilayer Perceptrons for Digit Recognition”, (2012; ):
The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent advancement by others dates back 8 years (error rate 0.4%).
Good old on-line backpropagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the MNIST handwritten digits benchmark with a single MLP, and 0.31% with a committee of 7 MLPs.
All we need to achieve this until-2011-best-result are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning. [Keywords: neural network, multilayer perceptron, GPU, training set deformations, MNIST, committee, backpropagation]
Note: This work combines 3 previously published papers [1,2,3].
…In recent decades the amount of raw computing power per Euro has grown bya factor of 100–1000 per decade. Our results show that this ongoing hardware progress may be more important than advances in algorithms and software (although the future will belong to methods combining the best of both worlds). Current graphics cards (GPUs) are already more than 50 times faster than standard microprocessors when it comes to training big and deep neural networks by the ancient algorithm, online backpropagation (weight update rate up to 7.5×109/
s, and more than 1015 per trained network). On the competitive MNIST handwriting benchmark, single precision floating-point GPU-based neural nets surpass all previously reported results, including those obtained by much more complex methods involving specialized architectures, unsupervised pre-training, combinations of machine learning classifiers etc. Training sets of sufficient size to avoid overfitting are obtained by appropriately deforming images. Of course,the approach is not limited to handwriting, and obviously holds great promise for many visual and other pattern recognition problems.
2012-hayworth.pdf: “ELECTRON IMAGING TECHNOLOGY FOR WHOLE BRAIN NEURAL CIRCUIT MAPPING”, KENNETH J. HAYWORTH ( )
2012-rintanen.pdf: “Planning as satisfiability: Heuristics”, (2012-12; ):
Reduction to SAT is a very successful approach to solving hard combinatorial problems in Artificial Intelligence and computer science in general. Most commonly, problem instances reduced to SAT are solved with a general-purpose SAT solver. Although there is the obvious possibility of improving the SAT solving process with application-specific heuristics, this has rarely been done successfully.
In this work we propose a planning-specific variable selection strategy for SAT solving. The strategy is based on generic principles about properties of plans, and its performance with standard planning benchmarks often substantially improves on generic variable selection heuristics, such as VSIDS, and often lifts it to the same level with other search methods such as explicit state-space search with heuristic search algorithms.
2013-bottou.pdf: “Large–Scale Machine Learning Revisited [slides]”, Léon Bottou ( )
2013-grace.pdf#miri: “Algorithmic Progress in Six Domains”, (2013-12-09; ):
We examine evidence of progress in 6 areas of algorithms research [SAT, chess+Go, factoring, physics simulations, linear programming+scheduling, machine learning], with an eye to understanding likely algorithmic trajectories after the advent of artificial general intelligence. Many of these areas appear to experience fast improvement, though the data are often noisy. For tasks in these areas, gains from algorithmic progress have been roughly 50 to 100% as large as those from hardware progress. Improvements tend to be incremental, forming a relatively smooth curve on the scale of years.
2013-yudkowsky.pdf#miri: “Intelligence Explosion Microeconomics”, (2013-09-13; ):
I. J. Good’s thesis of the “intelligence explosion” states that a sufficiently advanced machine intelligence could build a smarter version of itself, which could in turn build an even smarter version, and that this process could continue to the point of vastly exceeding human intelligence. As Sandberg (2010) correctly notes, there have been several attempts to lay down return on investment formulas intended to represent sharp speedups in economic or technological growth, but very little attempt has been made to deal formally with Good’s intelligence explosion thesis as such.
I identify the key issue as returns on cognitive reinvestment—the ability to invest more computing power, faster computers, or improved cognitive algorithms to yield cognitive labor which produces larger brains, faster brains, or better mind designs. There are many phenomena in the world which have been argued to be evidentially relevant to this question, from the observed course of hominid evolution, to Moore’s Law, to the competence over time of machine chess-playing systems, and many more. I go into some depth on some debates which then arise on how to interpret such evidence. I propose that the next step in analyzing positions on the intelligence explosion would be to formalize return on investment curves, so that each stance can formally state which possible micro-foundations they hold to be falsified by historical observations. More generally I pose multiple open questions of “returns on cognitive reinvestment” or “intelligence explosion microeconomics.” Although such questions have received little attention thus far, they seem highly relevant to policy choices affecting outcomes for Earth-originating intelligent life.
2014-cambria.pdf: “Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]”, (2014-04-10; ):
Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to look at the past, present, and future of NLP technology in a new light. Borrowing the paradigm of ` jumping curves’ from the field of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves—namely Syntactics, Semantics, and Pragmatics Curves—which will eventually lead NLP research to evolve into natural language understanding.
2015-bluche.pdf: “Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”, (2015-05-13; ):
The automatic transcription of text in handwritten documents has many applications, from automatic document processing, to indexing and document understanding.
One of the most popular approaches nowadays consists in scanning the text line image with a sliding window, from which features are extracted, and modeled by Hidden Markov Models (HMMs). Associated with neural networks, such as Multi-Layer Perceptrons (MLPs) or Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs), and with a language model, these models yield good transcriptions. On the other hand, in many machine learning applications, including speech recognition and computer vision, deep neural networks consisting of several hidden layers recently produced a large reduction of error rates.
In this thesis, we have conducted a thorough study of different aspects of optical models based on deep neural networks in the hybrid neural network /
HMM scheme, in order to better understand and evaluate their relative importance.
- First, we show that deep neural networks produce consistent and large improvements over networks with one or 2 hidden layers, independently of the kind of neural network, MLP or RNN, and of input, handcrafted features or pixels.
- Then, we show that deep neural networks with pixel inputs compete with those using handcrafted features, and that depth plays an important role in the reduction of the performance gap between the 2 kinds of inputs, supporting the idea that deep neural networks effectively build hierarchical and relevant representations of their inputs, and that features are automatically learnt on the way.
- Despite the dominance of LSTM-RNNs in the recent literature of handwriting recognition, we show that deep MLPs achieve comparable results. Moreover, we evaluated different training criteria. With sequence-discriminative training, we report similar improvements for MLP/HMMs as those observed in speech recognition.
- We also show how the Connectionist Temporal Classification framework is especially suited to RNNs.
- Finally, the novel dropout technique to regularize neural networks was recently applied to LSTM-RNNs. We tested its effect at different positions in LSTM-RNNs, thus extending previous works, and we show that its relative position to the recurrent connections is important.
We conducted the experiments on 3 public databases, representing 2 languages (English and French) and 2 epochs, using different kinds of neural network inputs: handcrafted features and pixels. We validated our approach by taking part to the HTRtS contest in 2014.
The results of the final systems presented in this thesis, namely MLPs and RNNs, with handcrafted feature or pixel inputs, are comparable to the state-of-the-art on Rimes and IAM. Moreover, the combination of these systems outperformed all published results on the considered databases. [Keywords: pattern recognition, Hidden Markov Models, neural networks, hand-writing recognition]
2015-zhu.pdf: “Machine Teaching: an Inverse Problem to Machine Learning and an Approach Toward Optimal Education”, (2015-01-01; ):
I draw the reader’s attention to machine teaching, the problem of finding an optimal training set given a machine learning algorithm and a target model. In addition to generating fascinating mathematical questions for computer scientists to ponder, machine teaching holds the promise of enhancing education and personnel training. The Socratic dialogue style aims to stimulate critical thinking.
2016-bayern.pdf: “The Implications of Modern Business-Entity Law for the Regulation of Autonomous Systems”, (2016-06; ):
Nonhuman autonomous systems are not legal persons under current law. The history of organizational law, however, demonstrates that agreements can, with increasing degrees of autonomy, direct the actions of legal persons. Agreements are isomorphic with algorithms; that is, a legally enforceable agreement can give legal effect to the arbitrary discernible states of an algorithm or other process. As a result, autonomous systems may end up being able, at least, to emulate many of the private-law rights of legal persons. This essay demonstrates a technique by which this is possible by means of limited liability companies (LLCs), a very flexible modern type of business organization. The techniques that this essay describes are not just futuristic possibilities; as this essay argues, they are already possible under current law.
2016-covington.pdf#google: “Deep Neural Networks for YouTube Recommendations”, (2016-09-15; ):
YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence. In this paper, we describe the system at a high level and focus on the dramatic performance improvements brought by deep learning. The paper is split according to the classic two-stage information retrieval dichotomy: first, we detail a deep candidate generation model and then describe a separate deep ranking model. We also provide practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous user-facing impact. [Keywords: recommender system; deep learning; scalability]
2016-goh-opennsfw.html: “Image Synthesis from Yahoo's
open_nsfw”, (2016; ):
Yahoo’s recently open sourced neural network,
open_nsfw, is a fine tuned Residual Network which scores images on a scale of 0 to 1 on its suitability for use in the workplace…What makes an image NSFW, according to Yahoo? I explore this question with a clever new visualization technique by Nguyen et al…Like Google’s Deep Dream, this visualization trick works by maximally activating certain neurons of the classifier. Unlike deep dream, we optimize these activations by performing descent on a parameterization of the manifold of natural images.
[Demonstration of an unusual use of backpropagation to ‘optimize’ a neural network: instead of taking a piece of data to input to a neural network and then updating the neural network to change its output slightly towards some desired output (such as a correct classification), one can instead update the input so as to make the neural net output slightly more towards the desired output. When using a image classification neural network, this reversed form of optimization will ‘hallucinate’ or ‘edit’ the ‘input’ to make it more like a particular class of images. In this case, a porn/NSFW-detecting NN is reversed so as to make images more (or less) “porn-like”. Goh runs this process on various images like landscapes, musical bands, or empty images; the maximally/
minimally porn-like images are disturbing, hilarious, and undeniably pornographic in some sense.]
2016-hernandezorallo.pdf: “Is Spearman's law of diminishing returns (SLODR) meaningful for artificial agents?”, Hernandez-Orallo
2017-esteva.pdf: “Dermatologist-level classification of skin cancer with deep neural networks”, Andre Esteva, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, Sebastian Thrun ( )
2017-shen.pdf: “Estimation of Gap Between Current Language Models and Human Performance”, (2017-01-01; ):
Language models (LMs) have gained dramatic improvement in the past years due to the wide application of neural networks. This raises the question of how far we are away from the perfect language model and how much more research is needed in language modelling. As for perplexity giving a value for human perplexity (as an upper bound of what is reasonably expected from an LM) is difficult. Word error rate (WER) has the disadvantage that it also measures the quality of other components of a speech recognizer like the acoustic model and the feature extraction. We therefore suggest evaluating LMs in a generative setting (which has been done before on selected hand-picked examples) and running a human evaluation on the generated sentences. The results imply that LMs need about 10 to 20 more years of research before human performance is reached. Moreover, we show that the human judgement scores on the generated sentences and perplexity are closely correlated. This leads to an estimated perplexity of 12 for an LM that would be able to pass the human judgement test in the setting we suggested. [Keywords: language model, generative task, human judgement score, performance gap]
2018-defauw.pdf: “Clinically applicable deep learning for diagnosis and referral in retinal disease”, Jeffrey Fauw, Joseph R. Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan Oamp#x02019;Donoghue, Daniel Visentin, George Driessche, Balaji Lakshminarayanan, Clemens Meyer, Faith Mackinder, Simon Bouton, Kareem Ayoub, Reena Chopra, Dominic King, Alan Karthikesalingam, Camp#x000ED;an O. Hughes, Rosalind Raine, Julian Hughes, Dawn A. Sim, Catherine Egan, Adnan Tufail, Hugh Montgomery, Demis Hassabis, Geraint Rees, Trevor Back, Peng T. Khaw, Mustafa Suleyman, Julien Cornebise, Pearse A. Keane, Olaf Ronneberger
2018-haenssle.pdf: “Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists”, Holger A. Haenssle, Christine Fink, Rol,Schneiderbauer, Ferdin,Toberer, Timo Buhl, Alan Blum, Aadi Kalloo, Abdulkadir Hassen, Litha M. Thomas, Alexander H. Enk, Lorenz Uhlmann ( )
2018-oakdenrayner.pdf: “Reply to 'Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists' by H. A. Haenssle et al”, Luke Oakden-Rayner
2018-poplin.pdf: “Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning”, (2018-01-01; ):
Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on 2 independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction. [Sex detection replicated in Korot et al 2021.]
2018-sharma.pdf#google: “Conceptual Captions: A Cleaned, Hypernymed, Image Alt–text Dataset For Automatic Image Captioning”, (2018-07-01; ):
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images [3.3m training] than the MS-COCO dataset (Lin et al 2014) and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages.
We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNetv2 (Szegedy et al 2016) for image-feature extraction and Transformer (Vaswani et al 2017) for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset.
2019-02-18-lecun-isscc-talk-deeplearninghardwarepastpresentandfuture.pdf: “Deep Learning Hardware: Past, Present, & Future”, Yann LeCun ( )
2019-abdal.pdf: “Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?”, (2019-04-05; ):
We propose an efficient algorithm to embed a given image into the latent space of StyleGAN. This embedding enables semantic image editing operations that can be applied to existing photographs. Taking the StyleGAN trained on the FFHQ dataset as an example, we show results for image morphing, style transfer, and expression transfer. Studying the results of the embedding algorithm provides valuable insights into the structure of the StyleGAN latent space. We propose a set of experiments to test what class of images can be embedded, how they are embedded, what latent space is suitable for embedding, and if the embedding is semantically meaningful.
…Going beyond faces, interestingly, we find that although the FFHQ StyleGAN generator is trained on a human face dataset, the embedding algorithm is capable to go far beyond human faces. As Figure 1 shows, although slightly worse than those of human faces, we can obtain reasonable and relatively high-quality embeddings of cats, dogs and even paintings and cars. This reveals the effective embedding capability of the algorithm and the generality of the learned filters of the generator.
2019-anumanchipalli.pdf: “Speech synthesis from neural decoding of spoken sentences”, (2019-04-24; ):
Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators.
Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences.
These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.
2019-bashivan.pdf: “Neural population control via deep image synthesis”, (2019):
Particular deep artificial neural networks (ANNs) are today’s most accurate models of the primate brain’s ventral visual stream. Using an ANN-driven image synthesis method, we found that luminous power patterns (i.e., images) can be applied to primate retinae to predictably push the spiking activity of targeted V4 neural sites beyond naturally occurring levels. This method, although not yet perfect, achieves unprecedented independent control of the activity state of entire populations of V4 neural sites, even those with overlapping receptive fields. These results show how the knowledge embedded in today’s ANN models might be used to noninvasively set desired internal brain states at neuron-level resolution, and suggest that more accurate ANN models would produce even more accurate control.
2019-brynjolfsson.pdf: “Does Machine Translation Affect International Trade? Evidence from a Large Digital Platform”, (2019-09-03; ):
Artificial intelligence (AI) is surpassing human performance in a growing number of domains. However, there is limited evidence of its economic effects. Using data from a digital platform, we study a key application of AI: machine translation. We find that the introduction of a new machine translation system has substantially increased international trade on this platform, increasing exports by 10.9%. Furthermore, heterogeneous treatment effects are consistent with a substantial reduction in translation costs. Our results provide causal evidence that language barriers substantially hinder trade and that AI has already begun to improve economic efficiency in at least one domain.
2019-dai.pdf: “SAN: Second-Order Attention Network for Single Image Super-Resolution”, (2019-06-15; ):
Recently, deep convolutional neural networks (CNNs) have been widely explored in single image super-resolution (SISR) and obtained remarkable performance. However, most of the existing CNN-based SISR methods mainly focus on wider or deeper architecture design, neglecting to explore the feature correlations of intermediate layers, hence hindering the representational power of CNNs. To address this issue, in this paper, we propose a second-order attention network (SAN) for more powerful feature expression and feature correlation learning. Specifically, a novel trainable second-order channel attention (SOCA) module is developed to adaptively rescale the channel-wise features by using second-order feature statistics for more discriminative representations. Furthermore, we present a non-locally enhanced residual group (NLRG) structure, which not only incorporates non-local operations to capture long-distance spatial contextual information, but also contains repeated local-source residual attention groups (LSRAG) to learn increasingly abstract feature representations. Experimental results demonstrate the superiority of our SAN network over state-of-the-art SISR methods in terms of both quantitative metrics and visual quality.
2019-gervais.pdf: “The Machine As Author”, (2019-03-24; ):
The use of Artificial Intelligence (AI) machines using deep learning neural networks to create material that facially looks like it should be protected by copyright is growing exponentially. From articles in national news media to music, film, poetry and painting, AI machines create material that has economic value and that competes with productions of human authors. The Article reviews both normative and doctrinal arguments for and against the protection by copyright of literary and artistic productions made by AI machines. The Article finds that the arguments in favor of protection are flawed and unconvincing and that a proper analysis of the history, purpose, and major doctrines of copyright law all lead to the conclusion that productions that do not result from human creative choices belong to the public domain. The Article proposes a test to determine which productions should be protected, including in case of collaboration between human and machine. Finally, the Article applies the proposed test to three specific fact patterns to illustrate its application. [Keywords: copyright, author, artificial intelligence, machine learning]
2019-kleinfeld.pdf: “Can One Concurrently Record Electrical Spikes from Every Neuron in a Mammalian Brain?”, David Kleinfeld, Lan Luan, Partha P. Mitra, Jacob T. Robinson, Rahul Sarpeshkar, Kenneth Shepard, Chong Xie, Timothy D. Harris
2019-liang.pdf: “Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence”, Huiying Liang, Brian Y. Tsui, Hao Ni, Carolina C. S. Valentim, Sally L. Baxter, Guangjian Liu, Wenjia Cai, Daniel S. Kermany, Xin Sun, Jiancong Chen, Liya He, Jie Zhu, Pin Tian, Hua Shao, Lianghong Zheng, Rui Hou, Sierra Hewett, Gen Li, Ping Liang, Xuan Zang, Zhiqi Zhang, Liyan Pan, Huimin Cai, Rujuan Ling, Shuhua Li, Yongwang Cui, Shusheng Tang, Hong Ye, Xiaoyan Huang, Waner He, Wenqing Liang, Qing Zhang, Jianmin Jiang, Wei Yu, Jianqun Gao, Wanxing Ou, Yingmin Deng, Qiaozhen Hou, Bei Wang, Cuichan Yao, Yan Liang, Shu Zhang, Yaou Duan, Runze Zhang, Sarah Gibson, Charlotte L. Zhang, Oulan Li, Edward D. Zhang, Gabriel Karin, Nathan Nguyen, Xiaokang Wu, Cindy Wen, Jie Xu, Wenqin Xu, Bochu Wang, Winston Wang, Jing Li, Bianca Pizzato, Caroline Bao, Daoman Xiang, Wanting He, Suiqin He, Yugui Zhou, Weldon Haw, Michael Goldbaum, Adriana Tremoulet, Chun-Nan Hsu, Hannah Carter, Long Zhu, Kang Zhang, Huimin Xia
2019-radford.pdf#openai: “Language Models are Unsupervised Multitask Learners”, (2019-02-14; ):
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets.
We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset—matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text.
These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
2019-richards.pdf: “A deep learning framework for neuroscience”, Blake A. Richards, Timothy P. Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy Berker, Surya Ganguli, Colleen J. Gillon, Danijar Hafner, Adam Kepecs, Nikolaus Kriegeskorte, Peter Latham, Grace W. Lindsay, Kenneth D. Miller, Richard Naud, Christopher C. Pack, Panayiota Poirazi, Pieter Roelfsema, João Sacramento, Andrew Saxe, Benjamin Scellier, Anna C. Schapiro, Walter Senn, Greg Wayne, Daniel Yamins, Friedemann Zenke, Joel Zylberberg, Denis Therien, Konrad P. Kording
2019-sinz.pdf: “Engineering a Less Artificial Intelligence”, Fabian H. Sinz, Xaq Pitkow, Jacob Reimer, Matthias Bethge, Andreas S. Tolias
2019-stiefel.pdf: “Why is There No Successful Whole Brain Simulation (Yet)?”, Klaus M. Stiefel
2019-topol.pdf: “High-performance medicine: the convergence of human and artificial intelligence”, Eric J. Topol
2019-winkler.pdf: “Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition”, Julia K. Winkler, Christine Fink, Ferdin,Toberer, Alexander Enk, Teresa Deinlein, Rainer Hofmann-Wellenhof, Luc Thomas, Aimilios Lallas, Andreas Blum, Wilhelm Stolz, Holger A. Haenssle ( )
2020-avram.pdf: “A digital biomarker of diabetes from smartphone-based vascular signals”, (2020-08-17; ):
The global burden of diabetes is rapidly increasing, from 451 million people in 2019 to 693 million by 2045. The insidious onset of type 2 diabetes delays diagnosis and increases morbidity. Given the multifactorial vascular effects of diabetes, we hypothesized that smartphone-based photoplethysmography could provide a widely accessible digital biomarker for diabetes. Here we developed a deep neural network (DNN) to detect prevalent diabetes using smartphone-based photoplethysmography from an initial cohort of 53,870 individuals (the ‘primary cohort’), which we then validated in a separate cohort of 7,806 individuals (the ‘contemporary cohort’) and a cohort of 181 prospectively enrolled individuals from three clinics (the ‘clinic cohort’). The DNN achieved an area under the curve for prevalent diabetes of 0.766 in the primary cohort (95% confidence interval: 0.750–0.782; sensitivity 75%, specificity 65%) and 0.740 in the contemporary cohort (95% confidence interval: 0.723–0.758; sensitivity 81%, specificity 54%). When the output of the DNN, called the DNN score, was included in a regression analysis alongside age, gender, race/
ethnicity and body mass index, the area under the curve was 0.830 and the DNN score remained independently predictive of diabetes. The performance of the DNN in the clinic cohort was similar to that in other validation datasets. There was a statistically-significant and positive association between the continuous DNN score and hemoglobin A1c (p ≤ 0.001) among those with hemoglobin A1c data. These findings demonstrate that smartphone-based photoplethysmography provides a readily attainable, non-invasive digital biomarker of prevalent diabetes.
2020-bao.pdf: “A map of object space in primate inferotemporal cortex”, (2020-06-03):
The inferotemporal (IT) cortex is responsible for object recognition, but it is unclear how the representation of visual objects is organized in this part of the brain. Areas that are selective for categories such as faces, bodies, and scenes have been found1,2,3,4,5, but large parts of IT cortex lack any known specialization, raising the question of what general principle governs IT organization. Here we used functional MRI, microstimulation, electrophysiology, and deep networks to investigate the organization of the macaque IT cortex. We built a low-dimensional object space to describe general objects using a feedforward deep neural network trained on object classification6. Responses of IT cells to a large set of objects revealed that single IT cells project incoming objects onto specific axes of this space. Anatomically, cells were clustered into four networks according to the first two components of their preferred axes, forming a map of object space. This map was repeated across three hierarchical stages of increasing view invariance, and cells that comprised these maps collectively harboured sufficient coding capacity to approximately reconstruct objects. These results provide a unified picture of IT organization in which category-selective regions are part of a coarse map of object space whose dimensions can be extracted from a deep network.
2020-barshai.pdf: “Identifying Regulatory Elements via Deep Learning”, (2020-07-01):
Deep neural networks have been revolutionizing the field of machine learning for the past several years. They have been applied with great success in many domains of the biomedical data sciences and are outperforming extant methods by a large margin. The ability of deep neural networks to pick up local image features and model the interactions between them makes them highly applicable to regulatory genomics. Instead of an image, the networks analyze DNA and RNA sequences and additional epigenomic data. In this review, we survey the successes of deep learning in the field of regulatory genomics. We first describe the fundamental building blocks of deep neural networks, popular architectures used in regulatory genomics, and their training process on molecular sequence data. We then review several key methods in different gene regulation domains. We start with the pioneering method DeepBind and its successors, which were developed to predict protein–DNA binding. We then review methods developed to predict and model epigenetic information, such as histone marks and nucleosome occupancy. Following epigenomics, we review methods to predict protein–RNA binding with its unique challenge of incorporating RNA structure information. Finally, we provide our overall view of the strengths and weaknesses of deep neural networks and prospects for future developments.
2020-bell.pdf#facebook: “GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce”, (2020-08-22; ):
In this paper, we present GrokNet, a deployed image recognition system for commerce applications. GrokNet leverages a multi-task learning approach to train a single computer vision trunk. We achieve a 2.1× improvement in exact product match accuracy when compared to the previous state-of-the-art Facebook product recognition system. We achieve this by training on 7 datasets across several commerce verticals, using 80 categorical loss functions and 3 embedding losses. We share our experience of combining diverse sources with wide-ranging label semantics and image statistics, including learning from human annotations, user-generated tags, and noisy search engine interaction data. GrokNet has demonstrated gains in production applications and operates at Facebook scale.
2020-chen.pdf#openai: “iGPT: Generative Pretraining from Pixels”, (2020-06-17; ):
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
[See also Image Transformer.]
2020-cowen.pdf: “Sixteen facial expressions occur in similar contexts worldwide”, (2020-12-16):
Understanding the degree to which human facial expressions co-vary with specific social contexts across cultures is central to the theory that emotions enable adaptive responses to important challenges and opportunities. Concrete evidence linking social context to specific facial expressions is sparse and is largely based on survey-based approaches, which are often constrained by language and small sample sizes. Here, by applying machine-learning methods to real-world, dynamic behaviour, we ascertain whether naturalistic social contexts (for example, weddings or sporting competitions) are associated with specific facial expressions across different cultures. In two experiments using deep neural networks, we examined the extent to which 16 types of facial expression occurred systematically in thousands of contexts in 6 million videos from 144 countries. We found that each kind of facial expression had distinct associations with a set of contexts that were 70% preserved across 12 world regions. Consistent with these associations, regions varied in how frequently different facial expressions were produced as a function of which contexts were most salient. Our results reveal fine-grained patterns in human facial expressions that are preserved across the modern world.
2020-hasson.pdf: “Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks”, (2020-02-05; ):
Evolution is a blind fitting process by which organisms become adapted to their environment. Does the brain use similar brute-force fitting processes to learn how to perceive and act upon the world? Recent advances in artificial neural networks have exposed the power of optimizing millions of synaptic weights over millions of observations to operate robustly in real-world contexts. These models do not learn simple, human-interpretable rules or representations of the world; rather, they use local computations to interpolate over task-relevant manifolds in a high-dimensional parameter space. Counterintuitively, similar to evolutionary processes, over-parameterized models can be simple and parsimonious, as they provide a versatile, robust solution for learning a diverse set of functions. This new family of direct-fit models present a radical challenge to many of the theoretical assumptions in psychology and neuroscience. At the same time, this shift in perspective establishes unexpected links with developmental and ecological psychology. [Keywords: evolution, experimental design, interpolation, learning, neural networks]
2020-hernandezorallo.pdf: “Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too”, (2020-11-04):
In the last 20 years the Turing test has been left further behind by new developments in artificial intelligence. At the same time, however, these developments have revived some key elements of the Turing test: imitation and adversarialness. On the one hand, many generative models, such as generative adversarial networks (GAN), build imitators under an adversarial setting that strongly resembles the Turing test (with the judge being a learnt discriminative model). The term “Turing learning” has been used for this kind of setting. On the other hand, AI benchmarks are suffering an adversarial situation too, with a ‘challenge-solve-and-replace’ evaluation dynamics whenever human performance is ‘imitated’. The particular AI community rushes to replace the old benchmark by a more challenging benchmark, one for which human performance would still be beyond AI. These two phenomena related to the Turing test are sufficiently distinctive, important and general for a detailed analysis. This is the main goal of this paper. After recognising the abyss that appears beyond superhuman performance, we build on Turing learning to identify two different evaluation schemas: Turing testing and adversarial testing. We revisit some of the key questions surrounding the Turing test, such as ‘understanding’, commonsense reasoning and extracting meaning from the world, and explore how the new testing paradigms should work to unmask the limitations of current and future AI. Finally, we discuss how behavioural similarity metrics could be used to create taxonomies for artificial and natural intelligence. Both testing schemas should complete a transition in which humans should give way to machines—not only as references to be imitated but also as judges—when pursuing and measuring machine intelligence.
2020-jouppi.pdf#google: “A domain-specific supercomputer for training deep neural networks”, (2020-06-01; ):
Google’s TPU supercomputers train deep neural networks 50× faster than general-purpose supercomputers running a high-performance computing benchmark. [See also “The Design Process for Google’s Training Chips: TPUv2 and TPUv3”, Norrie et al 2021]
2020-kreps.pdf: “All the News That's Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation”, (2020-11-20):
Online misinformation has become a constant; only the way actors create and distribute that information is changing. Advances in artificial intelligence (AI) such as GPT-2 mean that actors can now synthetically generate text in ways that mimic the style and substance of human-created news stories. We carried out three original experiments to study whether these AI-generated texts are credible and can influence opinions on foreign policy. The first evaluated human perceptions of AI-generated text relative to an original story. The second investigated the interaction between partisanship and AI-generated news. The third examined the distributions of perceived credibility across different AI model sizes. We find that individuals are largely incapable of distinguishing between AI-generated and human-generated text; partisanship affects the perceived credibility of the story; and exposure to the text does little to change individuals’ policy views. The findings have important implications in understanding AI in online misinformation campaigns. [Keywords: misinformation, disinformation, foreign policy, public opinion, media]
2020-launay-2.pdf: “Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures”, (2020-10-22; ):
Despite being the workhorse of deep learning, the backpropagation algorithm is no panacea. It enforces sequential layer updates, thus preventing efficient parallelization of the training process. Furthermore, its biological plausibility is being challenged. Alternative schemes have been devised; yet, under the constraint of synaptic asymmetry, none have scaled to modern deep learning tasks and architectures. Here, we challenge this perspective, and study the applicability of Direct Feedback Alignment (DFA) to neural view synthesis, recommender systems, geometric learning, and natural language processing. In contrast with previous studies limited to computer vision tasks, our findings show that it successfully trains a large range of state-of-the-art deep learning architectures, with performance close to fine-tuned backpropagation. When a larger gap between DFA and backpropagation exists, like in Transformers, we attribute this to a need to rethink common practices for large and complex architectures. At variance with common beliefs, our work supports that challenging tasks can be tackled in the absence of weight transport.
2020-leipheimer.pdf: “First-in-human evaluation of a hand-held automated venipuncture device for rapid venous blood draws”, (2020-01-22):
Obtaining venous access for blood sampling or intravenous (IV) fluid delivery is an essential first step in patient care. However, success rates rely heavily on clinician experience and patient physiology. Difficulties in obtaining venous access result in missed sticks and injury to patients, and typically require alternative access pathways and additional personnel that lengthen procedure times, thereby creating unnecessary costs to healthcare facilities.
Here, we present the first-in-human assessment of an automated robotic venipuncture device designed to safely perform blood draws on peripheral forearm veins. The device combines ultrasound imaging and miniaturized robotics to identify suitable vessels for cannulation and robotically guide an attached needle toward the lumen center. The device demonstrated results comparable to or exceeding that of clinical standards, with a success rate of 87% on all participants (n = 31), a 97% success rate on non-difficult venous access participants (n = 25), and an average procedure time of 93 ± 30 s (n = 31).
In the future, this device can be extended to other areas of vascular access such as IV catheterization, central venous access, dialysis, and arterial line placement. [Keywords: medical device, robotics, image-guidance, ultrasound, vascular access, computer vision, machine learning]
2020-lillicrap.pdf: “Backpropagation and the brain”, (2020-04-17):
During learning, the brain modifies synapses to improve behaviour. In the cortex, synapses are embedded within multilayered networks, making it difficult to determine the effect of an individual synaptic modification on the behaviour of the system. The backpropagation algorithm solves this problem in deep artificial neural networks, but historically it has been viewed as biologically problematic. Nonetheless, recent developments in neuroscience and the successes of artificial neural networks have reinvigorated interest in whether backpropagation offers insights for understanding learning in the cortex. The backpropagation algorithm learns quickly by computing synaptic updates using feedback connections to deliver error signals. Although feedback connections are ubiquitous in the cortex, it is difficult to see how they could deliver the error signals required by strict formulations of backpropagation. Here we build on past and recent developments to argue that feedback connections may instead induce neural activities whose differences can be used to locally approximate these signals and hence drive effective learning in deep networks in the brain.
2020-makin.pdf: “Machine translation of cortical activity to text with an encoder–decoder framework”, Joseph G. Makin, David A. Moses, Edward F. Chang
2020-mennel.pdf: “Ultrafast machine vision with 2D material neural network image sensors”, Lukas Mennel, Joanna Symonowicz, Stefan Wachter, Dmitry K. Polyushkin, Aday J. Molina-Mendoza, Thomas Mueller
2020-su.pdf: “Avatar Artist Using GAN [CS230]”, (2020-04-12; ):
Human sketches can be expressive and abstract at the same time. Generating anime avatars from simple or even bad face drawing is an interesting area. Lots of related work has been done such as auto-coloring sketches to anime or transforming real photos to anime. However, there aren’t many interesting works yet to show how to generate anime avatars from just some simple drawing input. In this project, we propose using GAN to generate anime avatars from sketches.
2020-thompson.pdf: “Cultural influences on word meanings revealed through large-scale semantic alignment”, (2020-08-10):
If the structure of language vocabularies mirrors the structure of natural divisions that are universally perceived, then the meanings of words in different languages should closely align. By contrast, if shared word meanings are a product of shared culture, history and geography, they may differ between languages in substantial but predictable ways. Here, we analysed the semantic neighbourhoods of 1,010 meanings in 41 languages. The most-aligned words were from semantic domains with high internal structure (number, quantity and kinship). Words denoting natural kinds, common actions and artefacts aligned much less well. Languages that are more geographically proximate, more historically related and/
or spoken by more-similar cultures had more aligned word meanings. These results provide evidence that the meanings of common words vary in ways that reflect the culture, history and geography of their users.
2020-vazquezguardado.pdf: “Recent advances in neurotechnologies with broad potential for neuroscience research”, (2020-11-16):
Interest in deciphering the fundamental mechanisms and processes of the human mind represents a central driving force in modern neuroscience research. Activities in support of this goal rely on advanced methodologies and engineering systems that are capable of interrogating and stimulating neural pathways, from single cells in small networks to interconnections that span the entire brain. Recent research establishes the foundations for a broad range of creative neurotechnologies that enable unique modes of operation in this context. This review focuses on those systems with proven utility in animal model studies and with levels of technical maturity that suggest a potential for broad deployment to the neuroscience community in the relatively near future. We include a brief summary of existing and emerging neuroscience techniques, as background for a primary focus on device technologies that address associated opportunities in electrical, optical and microfluidic neural interfaces, some with multimodal capabilities. Examples of the use of these technologies in recent neuroscience studies illustrate their practical value. The vibrancy of the engineering science associated with these platforms, the interdisciplinary nature of this field of research and its relevance to grand challenges in the treatment of neurological disorders motivate continued growth of this area of study.
2021-gangadharbatla.pdf: “The Role of AI Attribution Knowledge in the Evaluation of Artwork”, (2021-02-16):
Artwork is increasingly being created by machines through algorithms with little or no input from humans. Yet, very little is known about people’s attitudes and evaluations of artwork generated by machines. The current study investigates (a) whether individuals are able to accurately differentiate human-made artwork from AI-generated artwork and (b) the role of attribution knowledge (i.e., information about who created the content) in their evaluation and reception of artwork. Data was collected using an Amazon Turk sample from two survey experiments designed on Qualtrics. Findings suggest that individuals are unable to accurately identify AI-generated artwork and they are likely to associate representational art to humans and abstract art to machines. There is also an interaction effect between attribution knowledge and the type of artwork (representational vs. abstract) on purchase intentions and evaluations of artworks.
2021-norrie.pdf#google: “The Design Process for Google's Training Chips: TPUv2 and TPUv3”, (2021; ):
Five years ago, few would have predicted that a software company like Google would build its own computers. Nevertheless, Google has been deploying computers for machine learning (ML) training since 2017, powering key Google services. These Tensor Processing Units (TPUs) are composed of chips, systems, and software, all co-designed in-house. In this paper, we detail the circumstances that led to this outcome, the challenges and opportunities observed, the approach taken for the chips, a quick review of performance, and finally a retrospective on the results. A companion paper describes the supercomputers built from these chips, the compiler, and a detailed performance analysis [Jou20].
2021-power.pdf: “Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets”, Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra ( )
2021-santospata.pdf: “Epistemic Autonomy: Self-supervised Learning in the Mammalian Hippocampus”, (2021-04-24):
- Biological cognition is based on self-generated learning objectives. However, the mechanism by which this epistemic autonomy is realized by the neuronal substrate is not understood.
- Artificial neural networks based on error backpropagation lack epistemic autonomy because they are mostly trained in a supervised fashion. In this respect, they face the symbol grounding problem of artificial intelligence.
- We propose that the entorhinal-hippocampal complex, a brain structure located in the medial temporal lobe and central to memory, combines epistemic autonomy with intrinsically generated error gradients akin to error backpropagation.
- We present evidence supporting the hypothesis that the counter-current inhibitory projections of the entorhinal-hippocampal complex implement a continuous self-supervised error minimization between network input and output.
Biological cognition is based on the ability to autonomously acquire knowledge, or epistemic autonomy.
Such self-supervision is largely absent in artificial neural networks (ANN) because they depend on externally set learning criteria. Yet training ANN using error backpropagation has created the current revolution in artificial intelligence, raising the question of whether the epistemic autonomy displayed in biological cognition can be achieved with error backpropagation-based learning.
We present evidence suggesting that the entorhinal-hippocampal complex combines epistemic autonomy with error backpropagation. Specifically, we propose that the hippocampus minimizes the error between its input and output signals through a modulatory counter-current inhibitory network. We further discuss the computational emulation of this principle and analyze it in the context of autonomous cognitive systems. [Keywords: error backpropagation, self-supervised learning, hippocampus]