1962-bryson.pdf: “A Steepest-Ascent Method for Solving Optimum Programming Problems”, A. E. Bryson, W. F. Denham (1962-06-01):
A systematic and rapid steepest-ascent numerical procedure is described for solving two-point boundary-value problems in the calculus of variations for systems governed by a set of nonlinear ordinary differential equations. Numerical examples are presented for minimum time-to-climb and maximum altitude paths for a supersonic interceptor and maximum-range paths for an orbital glider. [Keywords: boundary-value problems, computer programming, differential equations, variational techniques]
…A systematic and rapid steepest-ascent numerical procedure is described for determining optimum programs for nonlinear systems with terminal constraints. The procedure uses the concept of local linearization around a nominal (non-optimum) path. The effect on the terminal conditions of a small change in the control variable program is determined by numerical integration of the adjoint differential equations for small perturbations about the nominal path. Having these adjoint (or influence) functions, it is then possible to determine the change in the control variable program that gives maximum increase in the pay-off function for a given mean-square perturbation of the control variable program while simultaneously changing the terminal quantities by desired amounts. By repeating this process in small steps, a control variable program that minimizes one quantity and yields specified values of other terminal quantities can be approached as closely as desired. Three numerical examples are presented: (a) The angle-of-attack program for a typical supersonic interceptor to climb to altitude in minimum time is determined with and without specified terminal velocity and heading. (b) The angle-of-attack program for the same interceptor to climb to maximum altitude is determined, (c) The angle-of-attack program is determined for a hypersonic orbital glider to obtain maximum surface range starting from satellite speed at 300,000 ft altitude.
1962-kelley.pdf: “Method of Gradients”, Henry J. Kelley (1962-01-01):
The method of gradients also known as method of steepest descent is an elementary concept for the solution of minimum problems. In recent years the computational appeal of the method has led to its adoption in a variety of application such as multivariable minimum problems of ordinary calculus, solution of systems of algebraic equations, integral equations, and variational problems. This chapter begins with a discussion of the main features of the gradient method in the context of ordinary minimum problems subject to constraints. It also discusses the variational problems of flight performance, introducing Green’s functions in the role played by partial derivatives in ordinary minimum problems and attempting to preserve an analogy between the two classes of problems in the subsequent development. The close relationship between Green’s functions or influence functions and the error coefficients of guidance theory has drawn attention to the usefulness of the adjoint system technique in guidance analysis.
1963-kelley.pdf: “Singular Extremals In Lawden's Problem Of Optimal Rocket Flight”, Henry J. Kelley (1963-07-01):
The problem of optimal rocket flight in an inverse square law force field has been studied extensively by Lawden and Leitmann. Periods of zero thrust, intermediate thrust, and maximum thrust are possible subarcs of the solution according to analysis of the Euler-Lagrange equations and the Weierstrass necessary condition. Arcs of intermediate thrust have been examined recently by Lawden; however, the question of whether or not such arcs actually may furnish a minimum has been left unresolved. The present paper derives the singular extremals of Lawden’s problem by means of the Legendre-Clebsch necessary condition applied in a transformed system of state and control variables.
These are obtained as circular orbits along which the thrust is zero and intermediate thrust arcs are found in Lawden’s analysis.Since these solutions satisfy only the weak form of the Legendre-Clebsch condition, i.e., the extremals are singular in the transformed system of variables, the question of their minimality remains unanswered.
1966-ivakhnenko.pdf: “Cybernetic Predicting Devices”, A. G. Ivakhnenko, V. G. Lapa (1966-09-23):
[Predicting programs designed for large general-purpose computers constitute an important new tool in the control of production and economics. Nevertheless, small predicting filters have their own domain of application. They can be realized not only as programs for general-purpose computers, but also as simple analog devices with very fast response. The authors discuss three principal methods of prediction in addition to some others. Prediction of deterministic processes, i.e. extrapolation and interpolation. Prediction of stochastic processes, based on statistical prediction theory. Prediction based on adaptation or learning of the predicting filters.]
1986-rumelhart-2.pdf: “Learning representations by backpropagating errors”, David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams (1986-10-09):
We describe a new learning procedure, backpropagation, for networks of neuron-like units.
The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units.
The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.
1986-rumelhart.pdf: “Learning Internal Representations by Error Propagation”, D. E. Rumelhart, G. E. Hinton, R. J. Williams (1986):
This paper presents a generalization of the perception learning procedure for learning the correct sets of connections for arbitrary networks. The rule, called the “generalized delta rule”, is a simple scheme for implementing a gradient descent method for finding weights that minimize the sum squared error of the system’s performance. The major theoretical contribution of the work is the procedure called “error propagation”, whereby the gradient can be determined by individual units of the network based only on locally available information. The major empirical contribution of the work is to show that the problem of local minima is not serious in this application of gradient descent. [Keywords: learning networks perceptrons adaptive systems learning machines, back propagation]
…In their pessimistic discussion of perceptrons, Minsky and Papert (1969) finally discuss multilayer machines near the end of their book. They state:
The perceptron has shown itself worthy of study despite (and even because of!) its severe limitations. It has many features that attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version, Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension is sterile. Perhaps some powerful convergence theorem will be discovered, or some profound reason for the failure to produce an interesting “learning theorem” for the multilayered machine will be found. (pp. 231–232)
Although our learning results do not guarantee that we can find a solution for all solvable problems, our analyses and results have shown that as a practical matter, the error propagation scheme leads to solutions in virtually every case. In short, we believe that we have answered Minsky and Papert’s challenge and have found a learning result sufficiently powerful to demonstrate that their pessimism about learning in multilayer machines was misplaced.
One way to view the procedure we have been describing is as a parallel computer that, having been shown the appropriate input/
output exemplars specifying some function, programs itself to compute that function in general. Parallel computers are notoriously difficult to program. Here we have a mechanism whereby we do not actually have to know how to write the program in order to get the system to do it. Parker (1985) has emphasized this point. [Learning-logic: casting the cortex of the human brain in silicon, (TR-47). Cambridge, MA: Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science]
On many occasions we have been surprised to learn of new methods of computing interesting functions by observing the behavior of our learning algorithm. This also raised the question of generalization. In most of the cases presented above, we have presented the system with the entire set of exemplars. It is interesting to ask what would happen if we presented only a subset of the exemplars at training time and then watched the system generalize to remaining exemplars. In small problems such as those presented here, the system sometimes finds solutions to the problems which do not properly generalize. However, preliminary results on larger problems are very encouraging in this regard. This research is still in progress and cannot be reported here. This is currently a very active interest of ours.
Finally, we should say that this work is not yet in a finished form. We have only begun our study of recurrent networks and sigma-pi units. We have not yet applied our learning procedure to many very complex problems. However, the results to date are encouraging and we are continuing our work.
1987-mcdermott.pdf: “A critique of pure reason”, Drew McDermott (1987-02-01):
[1987 retrospective by noted proponent of logic for planning and reasoning in AI (‘GOFAI’); McDermott criticizes his own work fiercely, along with that of his colleagues (particularly John McCarthy, Robert Moore, James Allen, Jerry Hobbs, & Patrick Hayes), describing the ‘logicist’ paradigm—that sufficiently ingenious and persistent application of logical reasoning, mostly first-order logic, can eventually give rise to human-level understanding of the world, planning & execution of actions, and eventually AGI.
McDermott concludes that the nature of such programs is that they are unable to see if they are making real progress (because a failure to infer something could simply reflect a lacking axiom), and worse, that such logics are not even an approximation to what intelligence is, or a role model, or that failures reflect poor choice of axioms, but that logics only verify things and do not compute useful things like plans, and collapse into verifying trivialities which do no useful intellectual work. Resorts to powerful tools like temporal logics or nonmonotonic logics sacrifice the philosophical advantages of logical inference in an attempt to get working systems, but may obtain neither. What is necessary is doing without deduction.]
It must be the case that a substantial portion of the inferences we want [to make] are deductions, or it will simply be irrelevant how many theorems follow deductively from a given axiom set.
…To summarize: The logicist project of expressing “naive physics” in first-order logic has not been very successful. One reason may be that the basic argument was flawed. You cannot write down axioms independent of a program for manipulating them if the inferences you are interested in are not deductions. Unfortunately, very few interesting inferences are deductions, and the attempts by logicists to extend logic to cover more territory have been disappointing. Hence we must resign ourselves to writing programs, and viewing knowledge representations as entities to be manipulated by the programs.
…Finally, I should admit that I am still doing work in the paradigm that I criticize here. In the domain of shape representation, so little is known that focusing on an idealization cannot but help teach us something. The problem I would like to tackle is representing the knowledge required to answer questions like, Could a paper clip be used as a key ring? The idealization I have been forced to fall back on is to prove that a paper clip of a certain size and shape could fit through the hole of a typical key. It should be obvious how much of the original problem this leaves out. Still, the territory is so unexplored that a tour through the idealized fragment could turn up something interesting. What one cannot hope for is to express as logical axioms everything there is to know about using shapes in unusual ways, before designing programs for this task. This will probably come as a shock to no one but me and a few friends.
1987-robinson.pdf: “The Utility Driven Dynamic Error Propagation Network”, A. J. Robinson, F. Fallside (1987-11-04):
Error propagation networks are able to learn a variety of tasks in which a static input pattern is mapped onto a static output pattern. This paper presents a generalisation of these nets to deal with time varying, or dynamic, patterns. Three possible architectures are explored which deal with learning sequences of known finite length and sequences of unknown and possibly infinite length. Several examples are given and an application to speech coding is discussed.
A further development of dynamic nets is made which allows them to be trained by a signal which expresses the correctness of the output of the net, the utility signal. On one possible architecture for such utility driven dynamic nets is given and a simple example is presented. Utility driven dynamic nets are potentially able to calculate and maximize any function of the input and output data streams, within the considered context. This is a very powerful property, and an appendix presents a comparison of the information processing in utility driven dynamic nets and that in the human brain.
1989-williams-2.pdf: “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks”, Ronald J. Williams, David Zipser (1989-06):
The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length.
1989-williams.pdf: “Experimental Analysis of the Real-time Recurrent Learning Algorithm”, Ronald J. Williams, David Zipser (1989):
The real-time recurrent learning algorithm (RTRL) is a gradient-following learning algorithm for completely recurrent networks running in continually sampled time. Here we use a series of simulation experiments to investigate the power and properties of this algorithm. In the recurrent networks studied here, any unit can be connected to any other, and any unit can receive external input. These networks run continually in the sense that they sample their inputs on every update cycle, and any unit can have a training target on any cycle. The storage required and computation time on each step are independent of time and are completely determined by the size of the network, so no prior knowledge of the temporal structure of the task being learned is required. The algorithm is nonlocal in the sense that each unit must have knowledge of the complete recurrent weight matrix and error vector. The algorithm is computationally intensive in sequential computers, requiring a storage capacity of the order of the third power of the number of units and a computation time on each cycle of the order of the fourth power of the number of units. The simulations include examples in which networks are taught tasks not possible with tapped delay lines—that is, tasks that require the preservation of state over potentially unbounded periods of time. The most complex example of this kind is learning to emulate a Turing machine that does a parenthesis balancing problem. Examples are also given of networks that do feedforward computations with unknown delays, requiring them to organize into networks with the correct number of layers. Finally, examples are given in which networks are trained to oscillate in various ways, including sinusoidal oscillation.
1990-schmidhuber.pdf: “A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks”, Jurgen Schmidhuber (1989):
Most known learning algorithms for dynamic neural networks in non-stationary environments need global computations to perform credit assignment. These algorithms either are not local in time or not local in space. Those algorithms which are local in both time and space usually cannot deal sensibly with ‘hidden units’. In contrast, as far as we can judge, learning rules in biological systems with many ‘hidden units’ are local in both space and time. In this paper we propose a parallel on-line learning algorithms which performs local computations only, yet still is designed to deal with hidden units and with units whose past activations are ‘hidden in time’. The approach is inspired by Holland’s idea of the bucket brigade for classifier systems, which is transformed to run on a neural network with fixed topology. The result is a feedforward or recurrent ‘neural’ dissipative system which is consuming ‘weight-substance’ and permanently trying to distribute this substance onto its connections in an appropriate way. Simple experiments demonstrating the feasibility of the algorithm are reported.
1990-schwartz.pdf: “Exhaustive Learning”, D. B. Schwartz, V. K. Samalam, Sara A. Solla, J. S. Denker (1990-09):
Exhaustive exploration of an ensemble of networks is used to model learning and generalization in layered neural networks. A simple Boolean learning problem involving networks with binary weights is numerically solved to obtain the entropy Sm and the average generalization ability Gm as a function of the size m of the training set. Learning curves Gm vs m are shown to depend solely on the distribution of generalization abilities over the ensemble of networks. Such distribution is determined prior to learning, and provides a novel theoretical tool for the prediction of network performance on a specific task.
1991-hochreiter.pdf: “Untersuchungen zu dynamischen neuronalen Netzen [Studies of dynamic neural networks]”, Sepp Hochreiter (1991-06-15):
[GPT-3 translation of German abstract]
Since the seminal article by Williams, Hinton, and Rumelhart [RHW86], backpropagation (BP) has become very popular as a learning method for neural networks with and without feedback. In contrast to many other learning methods for neural networks, BP takes into account the network structure and improves the network on the basis of this knowledge.
Since a very remote past input has to influence the present output, if it is randomly selected, this input is very unlikely to influence the present state of the network. Hence BP algorithms do not detect the fact that this input is responsible for the output desired. Therefore, BP algorithms are very hard to train a network to remember an input until it is needed to produce a later output. Moreover, the public BP algorithms take a very long time to compute.
In many cases, though, one needs an input sequence, as in Mozer [Moz90], which learns to compose music, where musical pieces are repeated and later note pitches are determined by previous note pitches. Steers a vehicle in a labyrinth, and the network obtains the error information only if the vehicle is in a dead end, so back-propagated errors are needed. If a neural network controls a robot that performs a task, perhaps some preparatory tasks are necessary whose performance the system should remember.
In this work, we investigate how to approach the problem of the long learning time associated with net-work inputs that are used later to control a desired output. This can be done either by means of the net-work architecture or by using the structure of the input sequences. In Chapter 4, a network is built so that inputs that are received at long delays are considered better than in the usual network architecture. Here, ‘storage nodes’ are introduced, which can carry information about an arbitrarily long time interval. The shortening of the input sequences, while retaining all relevant information, is investigated in Chapter 3. When a shortened input sequence must be recognized within this not so far back into the past, to recognize the relevant inputs. In Chapter 1 the used BP-Learn algorithms are presented, which are then in Chapter 2 analyzed to determine the cause of the long learning time to learn to store past inputs. To the algorithms it should be said that in some cases these were slightly modified to save computational time. The problem of resource acquisition occurring in Chapter 3 and 4 methods is addressed in Chapter 5.
The described experiments were performed on Sparc-based SUN stations. Due to time-resource constraints, algorithm comparison tests could not be carried out in the desired extent. There were trials that ran for up to a week on these machines, but other processes with higher priority were also running on these machines.
The definitions and notations in this work are not those commonly used in studies of neural networks, but they are introduced here only for this work. The reason is that there are no uniform, fundamental definitions for neural networks on which other authors would have based their work. Therefore, it is not guaranteed that there are no inconsistencies in the definitions and notations with other works.
These results were not all mathematically proven, as the work does not claim to be a mathematical analysis of neural networks. It is also difficult to find simple mathematical formalisms for neural networks. The work will rather describe ideas and approaches to see if it is possible to get a better grip on the problem of the long learning time for important previous inputs.
Besides the methods described here for learning in non-static environments, there is also the approach of the “Adaptive Critic”, as described in [Sch90a] and [Sch90c [“Recurrent networks adjusted by adaptive critics”]]. The approach of “fast weights” by Schmidhuber [Sch91b] founds a storage function, although with a completely different approach than in Chapter 4, where a storage is also constructed.
1991-schmidhuber.pdf: “Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks”, Jurgen Schmidhuber (1992):
Previous algorithms for supervised sequence learning are based on dynamic recurrent networks. This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: The first net learns to produce context-dependent weight changes for the second net whose weights may vary very quickly. The method offers the potential for STM storage efficiency: A single weight (instead of a full-fledged unit) may be sufficient for storing temporal information. Various learning methods are derived. Two experiments with unknown time delays illustrate the approach. One experiment shows how the system can be used for adaptive temporary variable binding.
1995-breiman.pdf: “Reflections After Refereeing Papers for NIPS”, Leo Breiman (1995-01-01):
The theoretical work by Weiner and others on the spectral analysis of stationary time series penetrated statistics following Tukey’s heuristic work on estimation of the spectrum. In refereeing papers for NIPS the author was struck by the growing emphasis on mathematical theory. Mathematical theory is not critical to the development of machine learning. In machine learning, the current panacea is a sigmoid network fitted using backpropagation. The pi-method, for approximating functions using noisy data, was suggested by results in mathematical approximation theory. In spite of intense activity, none of the work has had any effect on the day-to-day practice of statistics, or even on present-day theory. The useful theories was not meant to be inclusive, but even a more inclusive list would be very short. A possible reason is that it is difficult to formulate reasonable analytic models for complex data.
…Uses Of Theory
- Comfort: We knew it worked, but it’s nice to have a proof.
- Insight: Aha! So that’s why it works.
- Innovation: At last, a mathematically proven idea that applies to data.
- Suggestion: Something like this might work with data.
…Our fields would be better off with far fewer theorems, less emphasis on faddish stuff, and much more scientific inquiry and engineering. But the latter requires real thinking. For instance, there are many important questions regarding neural networks which are largely unanswered. There seem to be conflicting stories regarding the following issues:
- Why don’t heavily parameterized neural networks overfit the data?
- What is the effective number of parameters?
- Why doesn’t backpropagation head for a poor local minima?
- When should one stop the backpropagation and use the current parameters?
It makes research more interesting to know that there is no one universally best method. What is best is data dependent. Sometimes “least glamorous” methods such as nearest neighbor are best. We need to learn more about what works best where. But emphasis on theory often distracts us from doing good engineering and living with the data.
1995-mozer.pdf: “A Focused Backpropagation Algorithm for Temporal Pattern Recognition”, Michael C. Mozer (1995):
Time is at the heart of many pattern recognition tasks (e.g., speech recognition). However, connectionist learning algorithms to date are not well-suited for dealing with time-varying input patterns.
This chapter introduces a specialized connectionist architecture and corresponding specialization of the backpropagation learning algorithm that operates efficiently, both in computational time and space requirements, on temporal sequences. The key feature of the architecture is a layer of self-connected hidden units that integrate their current value with the new input at each time step to construct a static representation of the temporal input sequence. This architecture avoids two deficiencies found in the backpropagation unfolding-in-time procedure (Rumelhart, Hinton, & Williams, 1986) for handing sequence recognition tasks: first, it reduces the difficulty of temporal credit assignment by focusing the backpropagated error signal; second, it eliminates the need for a buffer to hold the input sequence and/
or intermediate activity levels. The latter property is due to the fact that during the forward (activation) phase, incremental activity traces can be locally computed that hold all information necessary for backpropagation in time.
It is argued that this architecture should scale better than conventional recurrent architectures with respect to sequence length. The architecture has been used to implement a temporal version of Rumelhart and McClelland’s (1986) verb past-tense model. The hidden units learn to behave something like Rumelhart and McClelland’s “Wickelphones,” a rich and flexible representation of temporal information
1996-opper.pdf: “Statistical Mechanics of Generalization”, Manfred Opper, Wolfgang Kinzel (1996):
We estimate a neural network’s ability to generalize from examples using ideas from statistical mechanics. We discuss the connection between this approach and other powerful concepts from mathematical statistics, computer science, and information theory that are useful in explaining the performance of such machines. For the simplest network, the perceptron, we introduce a variety of learning problems that can be treated exactly by the replica method of statistical physics.
1997-hochreiter.pdf: “Long Short-Term Memory”, Sepp Hochreiter, Jürgen Schmidhuber (1997-12-15):
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter’s (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM).
Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is 𝒪(1).
Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
2001-banko.pdf#microsoft: “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, Michele Banko, Eric Brill (2001-07-01):
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
…We collected a 1-billion-word training corpus from a variety of English texts, including news articles, scientific abstracts, government transcripts, literature and other varied forms of prose. This training corpus is three orders of magnitude greater than the largest training corpus previously used for this problem. We used 1 million words of Wall Street Journal text as our test set, and no data from the Wall Street Journal was used when constructing the training corpus. Each learner was trained at several cutoff points in the training corpus, i.e. the first one million words, the first five million words, and so on, until all one billion words were used for training. In order to avoid training biases that may result from merely concatenating the different data sources to form a larger training corpus, we constructed each consecutive training corpus by probabilistically sampling sentences from the different sources weighted by the size of each source.
In Figure 1, we show learning curves for each learner, up to one billion words of training data. Each point in the graph is the average performance over ten confusion sets for that size training corpus. Note that the curves appear to be log-linear even out to one billion words.
2002-behr.pdf: “Estimating and Comparing Entropy across Written Natural Languages Using PPM Compression”, Frederic H. Behr Jr., Victoria Fossum, Michael Mitzenmacher, David Xiao (2002):
Previous work on estimating the entropy of written natural language has focused primarily on English. We expand this work by considering other natural languages, including Arabic, Chinese, French, Greek, Japanese, Korean, Russian, and Spanish. We present the results of PPM compression on machine-generated and human-generated translations of texts into various languages. Under the assumption that languages are equally expressive, and that PPM compression does well across languages, one would expect that translated documents would compress to approximately the same size. We verify this empirically on a novel corpus of translated documents. We suggest as an application of this finding using the size of compressed natural language texts as a mean of automatically testing translation quality.
2002-bixby.pdf: “Solving Real-World Linear Programs: A Decade and More of Progress”, Robert E. Bixby (2002-02-01):
This paper is an invited contribution to the 50th anniversary issue of the journal Operations Research, published by the Institute of Operations Research and Management Science (INFORMS). It describes one person’s perspective on the development of computational tools for linear programming. The paper begins with a short personal history, followed by historical remarks covering the some 40 years of linear-programming developments that predate my own involvement in this subject. It concludes with a more detailed look at the evolution of computational linear programming since 1987.
…In this paper I have focused primarily on one issue, solving larger, more difficult linear programs faster. The numbers presented speak for themselves. 3 orders of magnitude in machine speed and 3 orders of magnitude in algorithmic speed add up to six orders of magnitude in solving power: A model that might have taken a year to solve 10 years ago can now solve in less than 30 seconds. Of course, no one waits 1 year to solve a model, at least no one I know. The real meaning of such an advance is much harder to measure in practice, but it is real nevertheless. There is no doubt that we now have optimization engines at our disposal that dwarf what was available only a few years ago, making possible the solution of real-world models once considered intractable, and opening up whole new domains of application.
How do these speed improvements fit into the overall picture of linear-programming practice? They are only a part of that picture, though an essential, enabling part. The pervasive availability of powerful, usable desktop computing, the availability of data to feed our models, and the emergence of algebraic modeling languages to represent our models have all combined with the underlying engines to make operations research and linear programming the powerful tools they are today. However, there are still important issues to be solved. In spite of all the advances, the application of linear programming remains primarily the domain of experts. The need for abstraction still stands as a hurdle between technology and solutions. While the existence of this hurdle is disconcerting, it is at least gratifying to know that the benefits from overcoming it are now greater than ever.
2006-friston.pdf: “A free energy principle for the brain”, Karl Friston, James Kilner, Lee Harrison (2006-07-01):
By formulating Helmholtz’s ideas about perception, in terms of modern-day theories, one arrives at a model of perceptual inference and learning that can explain a remarkable range of neurobiological facts: using constructs from statistical physics, the problems of inferring the causes of sensory input and learning the causal structure of their generation can be resolved using exactly the same principles. Furthermore, inference and learning can proceed in a biologically plausible fashion. The ensuing scheme rests on Empirical Bayes and hierarchical models of how sensory input is caused. The use of hierarchical models enables the brain to construct prior expectations in a dynamic and context-sensitive fashion. This scheme provides a principled way to understand many aspects of cortical organisation and responses.
In this paper, we show these perceptual processes are just one aspect of emergent behaviours of systems that conform to a free energy principle. The free energy considered here measures the difference between the probability distribution of environmental quantities that act on the system and an arbitrary distribution encoded by its configuration. The system can minimise free energy by changing its configuration to affect the way it samples the environment or change the distribution it encodes. These changes correspond to action and perception respectively and lead to an adaptive exchange with the environment that is characteristic of biological systems. This treatment assumes that the system’s state and structure encode an implicit and probabilistic model of the environment. We will look at the models entailed by the brain and how minimisation of its free energy can explain its dynamics and structure. [Keywords: Variational Bayes, free energy, inference, perception, action, learning, attention, selection, hierarchical]
2007-brants.pdf#google: “Large Language Models in Machine Translation”, Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean (2007-06):
This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large datasets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.
2009-halevy.pdf: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, Fernando Pereira (2009-03-24):
At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.
…For many tasks, words and word combinations provide all the representational machinery we need to learn from text.
…So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do
2012-rintanen.pdf: “Planning as satisfiability: Heuristics”, Jussi Rintanen (2012-12):
Reduction to SAT is a very successful approach to solving hard combinatorial problems in Artificial Intelligence and computer science in general. Most commonly, problem instances reduced to SAT are solved with a general-purpose SAT solver. Although there is the obvious possibility of improving the SAT solving process with application-specific heuristics, this has rarely been done successfully.
In this work we propose a planning-specific variable selection strategy for SAT solving. The strategy is based on generic principles about properties of plans, and its performance with standard planning benchmarks often substantially improves on generic variable selection heuristics, such as VSIDS, and often lifts it to the same level with other search methods such as explicit state-space search with heuristic search algorithms.
2013-grace.pdf#miri: “Algorithmic Progress in Six Domains”, Katja Grace (2013-12-09):
We examine evidence of progress in 6 areas of algorithms research [SAT, chess+Go, factoring, physics simulations, linear programming+scheduling, machine learning], with an eye to understanding likely algorithmic trajectories after the advent of artificial general intelligence. Many of these areas appear to experience fast improvement, though the data are often noisy. For tasks in these areas, gains from algorithmic progress have been roughly 50 to 100% as large as those from hardware progress. Improvements tend to be incremental, forming a relatively smooth curve on the scale of years.
2013-yudkowsky.pdf#miri: “Intelligence Explosion Microeconomics”, Eliezer Yudkowsky (2013-09-13):
I. J. Good’s thesis of the “intelligence explosion” states that a sufficiently advanced machine intelligence could build a smarter version of itself, which could in turn build an even smarter version, and that this process could continue to the point of vastly exceeding human intelligence. As Sandberg (2010) correctly notes, there have been several attempts to lay down return on investment formulas intended to represent sharp speedups in economic or technological growth, but very little attempt has been made to deal formally with Good’s intelligence explosion thesis as such.
I identify the key issue as returns on cognitive reinvestment—the ability to invest more computing power, faster computers, or improved cognitive algorithms to yield cognitive labor which produces larger brains, faster brains, or better mind designs. There are many phenomena in the world which have been argued to be evidentially relevant to this question, from the observed course of hominid evolution, to Moore’s Law, to the competence over time of machine chess-playing systems, and many more. I go into some depth on some debates which then arise on how to interpret such evidence. I propose that the next step in analyzing positions on the intelligence explosion would be to formalize return on investment curves, so that each stance can formally state which possible micro-foundations they hold to be falsified by historical observations. More generally I pose multiple open questions of “returns on cognitive reinvestment” or “intelligence explosion microeconomics.” Although such questions have received little attention thus far, they seem highly relevant to policy choices affecting outcomes for Earth-originating intelligent life.
2014-cambria.pdf: “Jumping NLP Curves: A Review of Natural Language Processing Research [Review Article]”, Erik Cambria, Bebo White (2014-04-10):
Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to look at the past, present, and future of NLP technology in a new light. Borrowing the paradigm of ` jumping curves’ from the field of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves—namely Syntactics, Semantics, and Pragmatics Curves—which will eventually lead NLP research to evolve into natural language understanding.
I draw the reader’s attention to machine teaching, the problem of finding an optimal training set given a machine learning algorithm and a target model. In addition to generating fascinating mathematical questions for computer scientists to ponder, machine teaching holds the promise of enhancing education and personnel training. The Socratic dialogue style aims to stimulate critical thinking.
Nonhuman autonomous systems are not legal persons under current law. The history of organizational law, however, demonstrates that agreements can, with increasing degrees of autonomy, direct the actions of legal persons. Agreements are isomorphic with algorithms; that is, a legally enforceable agreement can give legal effect to the arbitrary discernible states of an algorithm or other process. As a result, autonomous systems may end up being able, at least, to emulate many of the private-law rights of legal persons. This essay demonstrates a technique by which this is possible by means of limited liability companies (LLCs), a very flexible modern type of business organization. The techniques that this essay describes are not just futuristic possibilities; as this essay argues, they are already possible under current law.
2016-covington.pdf#google: “Deep Neural Networks for YouTube Recommendations”, Paul Covington, Jay Adams, Emre Sargin (2016-09-15):
YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence. In this paper, we describe the system at a high level and focus on the dramatic performance improvements brought by deep learning. The paper is split according to the classic two-stage information retrieval dichotomy: first, we detail a deep candidate generation model and then describe a separate deep ranking model. We also provide practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous user-facing impact. [Keywords: recommender system; deep learning; scalability]
2016-goh-opennsfw.html: “Image Synthesis from Yahoo's
open_nsfw”, Gabriel Goh (2016):
Yahoo’s recently open sourced neural network,
open_nsfw, is a fine tuned Residual Network which scores images on a scale of 0 to 1 on its suitability for use in the workplace…What makes an image NSFW, according to Yahoo? I explore this question with a clever new visualization technique by Nguyen et al…Like Google’s Deep Dream, this visualization trick works by maximally activating certain neurons of the classifier. Unlike deep dream, we optimize these activations by performing descent on a parameterization of the manifold of natural images.
[Demonstration of an unusual use of backpropagation to ‘optimize’ a neural network: instead of taking a piece of data to input to a neural network and then updating the neural network to change its output slightly towards some desired output (such as a correct classification), one can instead update the input so as to make the neural net output slightly more towards the desired output. When using a image classification neural network, this reversed form of optimization will ‘hallucinate’ or ‘edit’ the ‘input’ to make it more like a particular class of images. In this case, a porn/NSFW-detecting NN is reversed so as to make images more (or less) “porn-like”. Goh runs this process on various images like landscapes, musical bands, or empty images; the maximally/
minimally porn-like images are disturbing, hilarious, and undeniably pornographic in some sense.]
2017-shen.pdf: “Estimation of Gap Between Current Language Models and Human Performance”, Xiaoyu Shen, Youssef Oualil, Clayton Greenberg, Mittul Singh, Dietrich Klakow (2017-01-01):
Language models (LMs) have gained dramatic improvement in the past years due to the wide application of neural networks. This raises the question of how far we are away from the perfect language model and how much more research is needed in language modelling. As for perplexity giving a value for human perplexity (as an upper bound of what is reasonably expected from an LM) is difficult. Word error rate (WER) has the disadvantage that it also measures the quality of other components of a speech recognizer like the acoustic model and the feature extraction. We therefore suggest evaluating LMs in a generative setting (which has been done before on selected hand-picked examples) and running a human evaluation on the generated sentences. The results imply that LMs need about 10 to 20 more years of research before human performance is reached. Moreover, we show that the human judgement scores on the generated sentences and perplexity are closely correlated. This leads to an estimated perplexity of 12 for an LM that would be able to pass the human judgement test in the setting we suggested. [Keywords: language model, generative task, human judgement score, performance gap]
2018-poplin.pdf: “Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning”, Ryan Poplin, Avinash V. Varadarajan, Katy Blumer, Yun Liu, Michael V. McConnell, Greg S. Corrado, Lily Peng, Dale R. Webster (2018-01-01):
Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction.
2019-abdal.pdf: “Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?”, Rameen Abdal, Yipeng Qin, Peter Wonka (2019-04-05):
We propose an efficient algorithm to embed a given image into the latent space of StyleGAN. This embedding enables semantic image editing operations that can be applied to existing photographs. Taking the StyleGAN trained on the FFHQ dataset as an example, we show results for image morphing, style transfer, and expression transfer. Studying the results of the embedding algorithm provides valuable insights into the structure of the StyleGAN latent space. We propose a set of experiments to test what class of images can be embedded, how they are embedded, what latent space is suitable for embedding, and if the embedding is semantically meaningful.
…Going beyond faces, interestingly, we find that although the FFHQ StyleGAN generator is trained on a human face dataset, the embedding algorithm is capable to go far beyond human faces. As Figure 1 shows, although slightly worse than those of human faces, we can obtain reasonable and relatively high-quality embeddings of cats, dogs and even paintings and cars. This reveals the effective embedding capability of the algorithm and the generality of the learned filters of the generator.
2019-anumanchipalli.pdf: “Speech synthesis from neural decoding of spoken sentences”, Gopala K. Anumanchipalli, Josh Chartier, Edward F. Chang (2019-04-24):
Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators.
Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences.
These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.
2019-bashivan.pdf: “Neural population control via deep image synthesis”, Pouya Bashivan, Kohitij Kar, James J. DiCarlo (2019):
Particular deep artificial neural networks (ANNs) are today’s most accurate models of the primate brain’s ventral visual stream. Using an ANN-driven image synthesis method, we found that luminous power patterns (i.e., images) can be applied to primate retinae to predictably push the spiking activity of targeted V4 neural sites beyond naturally occurring levels. This method, although not yet perfect, achieves unprecedented independent control of the activity state of entire populations of V4 neural sites, even those with overlapping receptive fields. These results show how the knowledge embedded in today’s ANN models might be used to noninvasively set desired internal brain states at neuron-level resolution, and suggest that more accurate ANN models would produce even more accurate control.
2019-brynjolfsson.pdf: “Does Machine Translation Affect International Trade? Evidence from a Large Digital Platform”, Erik Brynjolfsson, Xiang Hui, Meng Liu (2019-09-03):
Artificial intelligence (AI) is surpassing human performance in a growing number of domains. However, there is limited evidence of its economic effects. Using data from a digital platform, we study a key application of AI: machine translation. We find that the introduction of a new machine translation system has substantially increased international trade on this platform, increasing exports by 10.9%. Furthermore, heterogeneous treatment effects are consistent with a substantial reduction in translation costs. Our results provide causal evidence that language barriers substantially hinder trade and that AI has already begun to improve economic efficiency in at least one domain.
2019-dai.pdf: “SAN: Second-Order Attention Network for Single Image Super-Resolution”, Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, Lei Zhang (2019-06-15):
Recently, deep convolutional neural networks (CNNs) have been widely explored in single image super-resolution (SISR) and obtained remarkable performance. However, most of the existing CNN-based SISR methods mainly focus on wider or deeper architecture design, neglecting to explore the feature correlations of intermediate layers, hence hindering the representational power of CNNs. To address this issue, in this paper, we propose a second-order attention network (SAN) for more powerful feature expression and feature correlation learning. Specifically, a novel trainable second-order channel attention (SOCA) module is developed to adaptively rescale the channel-wise features by using second-order feature statistics for more discriminative representations. Furthermore, we present a non-locally enhanced residual group (NLRG) structure, which not only incorporates non-local operations to capture long-distance spatial contextual information, but also contains repeated local-source residual attention groups (LSRAG) to learn increasingly abstract feature representations. Experimental results demonstrate the superiority of our SAN network over state-of-the-art SISR methods in terms of both quantitative metrics and visual quality.
2019-gervais.pdf: “The Machine As Author”, Daniel J. Gervais (2019-03-24):
The use of Artificial Intelligence (AI) machines using deep learning neural networks to create material that facially looks like it should be protected by copyright is growing exponentially. From articles in national news media to music, film, poetry and painting, AI machines create material that has economic value and that competes with productions of human authors. The Article reviews both normative and doctrinal arguments for and against the protection by copyright of literary and artistic productions made by AI machines. The Article finds that the arguments in favor of protection are flawed and unconvincing and that a proper analysis of the history, purpose, and major doctrines of copyright law all lead to the conclusion that productions that do not result from human creative choices belong to the public domain. The Article proposes a test to determine which productions should be protected, including in case of collaboration between human and machine. Finally, the Article applies the proposed test to three specific fact patterns to illustrate its application. [Keywords: copyright, author, artificial intelligence, machine learning]
2019-radford.pdf#openai: “Language Models are Unsupervised Multitask Learners”, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever (2019-02-14):
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets.
We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset—matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text.
These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
2020-avram.pdf: “A digital biomarker of diabetes from smartphone-based vascular signals”, Robert Avram, Jeffrey E. Olgin, Peter Kuhar, J. Weston Hughes, Gregory M. Marcus, Mark J. Pletcher, Kirstin Aschbacher, Geoffrey H. Tison (2020-08-17):
The global burden of diabetes is rapidly increasing, from 451 million people in 2019 to 693 million by 2045. The insidious onset of type 2 diabetes delays diagnosis and increases morbidity. Given the multifactorial vascular effects of diabetes, we hypothesized that smartphone-based photoplethysmography could provide a widely accessible digital biomarker for diabetes. Here we developed a deep neural network (DNN) to detect prevalent diabetes using smartphone-based photoplethysmography from an initial cohort of 53,870 individuals (the ‘primary cohort’), which we then validated in a separate cohort of 7,806 individuals (the ‘contemporary cohort’) and a cohort of 181 prospectively enrolled individuals from three clinics (the ‘clinic cohort’). The DNN achieved an area under the curve for prevalent diabetes of 0.766 in the primary cohort (95% confidence interval: 0.750–0.782; sensitivity 75%, specificity 65%) and 0.740 in the contemporary cohort (95% confidence interval: 0.723–0.758; sensitivity 81%, specificity 54%). When the output of the DNN, called the DNN score, was included in a regression analysis alongside age, gender, race/
ethnicity and body mass index, the area under the curve was 0.830 and the DNN score remained independently predictive of diabetes. The performance of the DNN in the clinic cohort was similar to that in other validation datasets. There was a statistically-significant and positive association between the continuous DNN score and hemoglobin A1c (p ≤ 0.001) among those with hemoglobin A1c data. These findings demonstrate that smartphone-based photoplethysmography provides a readily attainable, non-invasive digital biomarker of prevalent diabetes.
2020-bao.pdf: “A map of object space in primate inferotemporal cortex”, Pinglei Bao, Liang She, Mason McGill, Doris Y. Tsao (2020-06-03):
The inferotemporal (IT) cortex is responsible for object recognition, but it is unclear how the representation of visual objects is organized in this part of the brain. Areas that are selective for categories such as faces, bodies, and scenes have been found1,2,3,4,5, but large parts of IT cortex lack any known specialization, raising the question of what general principle governs IT organization. Here we used functional MRI, microstimulation, electrophysiology, and deep networks to investigate the organization of the macaque IT cortex. We built a low-dimensional object space to describe general objects using a feedforward deep neural network trained on object classification6. Responses of IT cells to a large set of objects revealed that single IT cells project incoming objects onto specific axes of this space. Anatomically, cells were clustered into four networks according to the first two components of their preferred axes, forming a map of object space. This map was repeated across three hierarchical stages of increasing view invariance, and cells that comprised these maps collectively harboured sufficient coding capacity to approximately reconstruct objects. These results provide a unified picture of IT organization in which category-selective regions are part of a coarse map of object space whose dimensions can be extracted from a deep network.
2020-barshai.pdf: “Identifying Regulatory Elements via Deep Learning”, Mira Barshai, Eitamar Tripto, Yaron Orenstein (2020-07):
Deep neural networks have been revolutionizing the field of machine learning for the past several years. They have been applied with great success in many domains of the biomedical data sciences and are outperforming extant methods by a large margin. The ability of deep neural networks to pick up local image features and model the interactions between them makes them highly applicable to regulatory genomics. Instead of an image, the networks analyze DNA and RNA sequences and additional epigenomic data. In this review, we survey the successes of deep learning in the field of regulatory genomics. We first describe the fundamental building blocks of deep neural networks, popular architectures used in regulatory genomics, and their training process on molecular sequence data. We then review several key methods in different gene regulation domains. We start with the pioneering method DeepBind and its successors, which were developed to predict protein–DNA binding. We then review methods developed to predict and model epigenetic information, such as histone marks and nucleosome occupancy. Following epigenomics, we review methods to predict protein–RNA binding with its unique challenge of incorporating RNA structure information. Finally, we provide our overall view of the strengths and weaknesses of deep neural networks and prospects for future developments.
2020-bell.pdf#facebook: “GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce”, Sean Bell, Yiqun Liu, Sami Alsheikh, Yina Tang, Ed Pizzi, M. Henning, Karun Singh, Omkar Parkhi, Fedor Borisyuk (2020-08-22):
In this paper, we present GrokNet, a deployed image recognition system for commerce applications. GrokNet leverages a multi-task learning approach to train a single computer vision trunk. We achieve a 2.1× improvement in exact product match accuracy when compared to the previous state-of-the-art Facebook product recognition system. We achieve this by training on 7 datasets across several commerce verticals, using 80 categorical loss functions and 3 embedding losses. We share our experience of combining diverse sources with wide-ranging label semantics and image statistics, including learning from human annotations, user-generated tags, and noisy search engine interaction data. GrokNet has demonstrated gains in production applications and operates at Facebook scale.
2020-chen.pdf#openai: “iGPT: Generative Pretraining from Pixels”, Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever (OpenAI) (2020-06-17):
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
[See also Image Transformer.]
2020-cowen.pdf: “Sixteen facial expressions occur in similar contexts worldwide”, Alan S. Cowen (2020-12-16):
Understanding the degree to which human facial expressions co-vary with specific social contexts across cultures is central to the theory that emotions enable adaptive responses to important challenges and opportunities. Concrete evidence linking social context to specific facial expressions is sparse and is largely based on survey-based approaches, which are often constrained by language and small sample sizes. Here, by applying machine-learning methods to real-world, dynamic behaviour, we ascertain whether naturalistic social contexts (for example, weddings or sporting competitions) are associated with specific facial expressions across different cultures. In two experiments using deep neural networks, we examined the extent to which 16 types of facial expression occurred systematically in thousands of contexts in 6 million videos from 144 countries. We found that each kind of facial expression had distinct associations with a set of contexts that were 70% preserved across 12 world regions. Consistent with these associations, regions varied in how frequently different facial expressions were produced as a function of which contexts were most salient. Our results reveal fine-grained patterns in human facial expressions that are preserved across the modern world.
2020-hasson.pdf: “Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks”, Uri Hasson, Samuel A. Nastase, Ariel Goldstein (2020-02-05):
Evolution is a blind fitting process by which organisms become adapted to their environment. Does the brain use similar brute-force fitting processes to learn how to perceive and act upon the world? Recent advances in artificial neural networks have exposed the power of optimizing millions of synaptic weights over millions of observations to operate robustly in real-world contexts. These models do not learn simple, human-interpretable rules or representations of the world; rather, they use local computations to interpolate over task-relevant manifolds in a high-dimensional parameter space. Counterintuitively, similar to evolutionary processes, over-parameterized models can be simple and parsimonious, as they provide a versatile, robust solution for learning a diverse set of functions. This new family of direct-fit models present a radical challenge to many of the theoretical assumptions in psychology and neuroscience. At the same time, this shift in perspective establishes unexpected links with developmental and ecological psychology. [Keywords: evolution, experimental design, interpolation, learning, neural networks]
2020-hernandezorallo.pdf: “Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too”, José Hernández-Orallo (2020-11-04):
In the last 20 years the Turing test has been left further behind by new developments in artificial intelligence. At the same time, however, these developments have revived some key elements of the Turing test: imitation and adversarialness. On the one hand, many generative models, such as generative adversarial networks (GAN), build imitators under an adversarial setting that strongly resembles the Turing test (with the judge being a learnt discriminative model). The term “Turing learning” has been used for this kind of setting. On the other hand, AI benchmarks are suffering an adversarial situation too, with a ‘challenge-solve-and-replace’ evaluation dynamics whenever human performance is ‘imitated’. The particular AI community rushes to replace the old benchmark by a more challenging benchmark, one for which human performance would still be beyond AI. These two phenomena related to the Turing test are sufficiently distinctive, important and general for a detailed analysis. This is the main goal of this paper. After recognising the abyss that appears beyond superhuman performance, we build on Turing learning to identify two different evaluation schemas: Turing testing and adversarial testing. We revisit some of the key questions surrounding the Turing test, such as ‘understanding’, commonsense reasoning and extracting meaning from the world, and explore how the new testing paradigms should work to unmask the limitations of current and future AI. Finally, we discuss how behavioural similarity metrics could be used to create taxonomies for artificial and natural intelligence. Both testing schemas should complete a transition in which humans should give way to machines—not only as references to be imitated but also as judges—when pursuing and measuring machine intelligence.
2020-jouppi.pdf#google: “A domain-specific supercomputer for training deep neural networks”, Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, David Patterson (2020-06-01):
Google’s TPU supercomputers train deep neural networks 50× faster than general-purpose supercomputers running a high-performance computing benchmark. [See also “The Design Process for Google’s Training Chips: TPUv2 and TPUv3”, Norrie et al 2021]
2020-kreps.pdf: “All the News That's Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation”, Sarah Kreps, R. Miles McCain, Miles Brundage (2020-11-20):
Online misinformation has become a constant; only the way actors create and distribute that information is changing. Advances in artificial intelligence (AI) such as GPT-2 mean that actors can now synthetically generate text in ways that mimic the style and substance of human-created news stories. We carried out three original experiments to study whether these AI-generated texts are credible and can influence opinions on foreign policy. The first evaluated human perceptions of AI-generated text relative to an original story. The second investigated the interaction between partisanship and AI-generated news. The third examined the distributions of perceived credibility across different AI model sizes. We find that individuals are largely incapable of distinguishing between AI-generated and human-generated text; partisanship affects the perceived credibility of the story; and exposure to the text does little to change individuals’ policy views. The findings have important implications in understanding AI in online misinformation campaigns. [Keywords: misinformation, disinformation, foreign policy, public opinion, media]
2020-launay-2.pdf: “Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures”, Julien Launay, Iacopo Poli, François Boniface, Florent Krzakala (2020-10-22):
Despite being the workhorse of deep learning, the backpropagation algorithm is no panacea. It enforces sequential layer updates, thus preventing efficient parallelization of the training process. Furthermore, its biological plausibility is being challenged. Alternative schemes have been devised; yet, under the constraint of synaptic asymmetry, none have scaled to modern deep learning tasks and architectures. Here, we challenge this perspective, and study the applicability of Direct Feedback Alignment (DFA) to neural view synthesis, recommender systems, geometric learning, and natural language processing. In contrast with previous studies limited to computer vision tasks, our findings show that it successfully trains a large range of state-of-the-art deep learning architectures, with performance close to fine-tuned backpropagation. When a larger gap between DFA and backpropagation exists, like in Transformers, we attribute this to a need to rethink common practices for large and complex architectures. At variance with common beliefs, our work supports that challenging tasks can be tackled in the absence of weight transport.
2020-leipheimer.pdf: “First-in-human evaluation of a hand-held automated venipuncture device for rapid venous blood draws”, Josh M. Leipheimer, Max L. Balter, Alvin I. Chen, Enrique J. Pantin, Alexander E. Davidovich, Kristen S. Labazzo, Martin L. Yarmush (2020-01-22):
Obtaining venous access for blood sampling or intravenous (IV) fluid delivery is an essential first step in patient care. However, success rates rely heavily on clinician experience and patient physiology. Difficulties in obtaining venous access result in missed sticks and injury to patients, and typically require alternative access pathways and additional personnel that lengthen procedure times, thereby creating unnecessary costs to healthcare facilities.
Here, we present the first-in-human assessment of an automated robotic venipuncture device designed to safely perform blood draws on peripheral forearm veins. The device combines ultrasound imaging and miniaturized robotics to identify suitable vessels for cannulation and robotically guide an attached needle toward the lumen center. The device demonstrated results comparable to or exceeding that of clinical standards, with a success rate of 87% on all participants (n = 31), a 97% success rate on non-difficult venous access participants (n = 25), and an average procedure time of 93 ± 30 s (n = 31).
In the future, this device can be extended to other areas of vascular access such as IV catheterization, central venous access, dialysis, and arterial line placement. [Keywords: medical device, robotics, image-guidance, ultrasound, vascular access, computer vision, machine learning]
2020-lillicrap.pdf: “Backpropagation and the brain”, Timothy P. Lillicrap, Adam Santoro, Luke Marris, Colin J. Akerman, Geoffrey Hinton (2020-04-17):
During learning, the brain modifies synapses to improve behaviour. In the cortex, synapses are embedded within multilayered networks, making it difficult to determine the effect of an individual synaptic modification on the behaviour of the system. The backpropagation algorithm solves this problem in deep artificial neural networks, but historically it has been viewed as biologically problematic. Nonetheless, recent developments in neuroscience and the successes of artificial neural networks have reinvigorated interest in whether backpropagation offers insights for understanding learning in the cortex. The backpropagation algorithm learns quickly by computing synaptic updates using feedback connections to deliver error signals. Although feedback connections are ubiquitous in the cortex, it is difficult to see how they could deliver the error signals required by strict formulations of backpropagation. Here we build on past and recent developments to argue that feedback connections may instead induce neural activities whose differences can be used to locally approximate these signals and hence drive effective learning in deep networks in the brain.
2020-su.pdf: “Avatar Artist Using GAN [CS230]”, Hui Su, Jin Fang (2020-04-12):
Human sketches can be expressive and abstract at the same time. Generating anime avatars from simple or even bad face drawing is an interesting area. Lots of related work has been done such as auto-coloring sketches to anime or transforming real photos to anime. However, there aren’t many interesting works yet to show how to generate anime avatars from just some simple drawing input. In this project, we propose using GAN to generate anime avatars from sketches.
2020-thompson.pdf: “Cultural influences on word meanings revealed through large-scale semantic alignment”, Bill Thompson, Seán G. Roberts, Gary Lupyan (2020-08-10):
If the structure of language vocabularies mirrors the structure of natural divisions that are universally perceived, then the meanings of words in different languages should closely align. By contrast, if shared word meanings are a product of shared culture, history and geography, they may differ between languages in substantial but predictable ways. Here, we analysed the semantic neighbourhoods of 1,010 meanings in 41 languages. The most-aligned words were from semantic domains with high internal structure (number, quantity and kinship). Words denoting natural kinds, common actions and artefacts aligned much less well. Languages that are more geographically proximate, more historically related and/
or spoken by more-similar cultures had more aligned word meanings. These results provide evidence that the meanings of common words vary in ways that reflect the culture, history and geography of their users.
2020-vazquezguardado.pdf: “Recent advances in neurotechnologies with broad potential for neuroscience research”, Abraham Vázquez-Guardado, Yiyuan Yang, Amay J. Bandodkar, John A. Rogers (2020-11-16):
Interest in deciphering the fundamental mechanisms and processes of the human mind represents a central driving force in modern neuroscience research. Activities in support of this goal rely on advanced methodologies and engineering systems that are capable of interrogating and stimulating neural pathways, from single cells in small networks to interconnections that span the entire brain. Recent research establishes the foundations for a broad range of creative neurotechnologies that enable unique modes of operation in this context. This review focuses on those systems with proven utility in animal model studies and with levels of technical maturity that suggest a potential for broad deployment to the neuroscience community in the relatively near future. We include a brief summary of existing and emerging neuroscience techniques, as background for a primary focus on device technologies that address associated opportunities in electrical, optical and microfluidic neural interfaces, some with multimodal capabilities. Examples of the use of these technologies in recent neuroscience studies illustrate their practical value. The vibrancy of the engineering science associated with these platforms, the interdisciplinary nature of this field of research and its relevance to grand challenges in the treatment of neurological disorders motivate continued growth of this area of study.
2021-gangadharbatla.pdf: “The Role of AI Attribution Knowledge in the Evaluation of Artwork”, Harsha Gangadharbatla (2021-02-16):
Artwork is increasingly being created by machines through algorithms with little or no input from humans. Yet, very little is known about people’s attitudes and evaluations of artwork generated by machines. The current study investigates (a) whether individuals are able to accurately differentiate human-made artwork from AI-generated artwork and (b) the role of attribution knowledge (i.e., information about who created the content) in their evaluation and reception of artwork. Data was collected using an Amazon Turk sample from two survey experiments designed on Qualtrics. Findings suggest that individuals are unable to accurately identify AI-generated artwork and they are likely to associate representational art to humans and abstract art to machines. There is also an interaction effect between attribution knowledge and the type of artwork (representational vs. abstract) on purchase intentions and evaluations of artworks.
2021-norrie.pdf#google: “The Design Process for Google's Training Chips: TPUv2 and TPUv3”, Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, David Patterson (2021):
Five years ago, few would have predicted that a software company like Google would build its own computers. Nevertheless, Google has been deploying computers for machine learning (ML) training since 2017, powering key Google services. These Tensor Processing Units (TPUs) are composed of chips, systems, and software, all co-designed in-house. In this paper, we detail the circumstances that led to this outcome, the challenges and opportunities observed, the approach taken for the chips, a quick review of performance, and finally a retrospective on the results. A companion paper describes the supercomputers built from these chips, the compiler, and a detailed performance analysis [Jou20].