Automatic font image synthesis has been an extremely active topic in recent years. Various deep learning-based approaches have been proposed to tackle this font synthesis task by considering it as an image-to-image translation problem in a supervised setting. However, all such approaches mainly focus on one-to-one font mapping, i.e., synthesizing a single font style, making it difficult to handle more practical problems such as the font family synthesis, which is a one-to-many mapping problem. Moreover, this font family synthesis is more challenging because it is an unsupervised image-to-image translation problem, i.e., no paired dataset is available during training.
To address this font family synthesis problem, we propose a method that utilizes a single generator to conditionally produce various font family styles to form a font family. To the best of our knowledge, our proposed method is the first to synthesize a font family (multiple font styles belonging to a font), instead of synthesizing a single font style. More specifically, our method is trained to learn a font family by conditioning on various styles, e.g., normal, bold, italic, bold-italic, etc. After training, given an unobserved single font style (normal style font as an input), our method can successfully synthesize the remaining styles (e.g., bold, italic, bold-italic, etc.) to complete the font family.
Qualitative and quantitative experiments were conducted to demonstrate the effectiveness of our proposed method.
Convolutional Neural Networks (CNNs) can perform similarly or better than standard genomic prediction methods when sufficient genetic, environmental, and management data are provided.
Predicting phenotypes from genetic (G), environmental (E), and management (M) conditions is a long-standing challenge with implications to agriculture, medicine, and conservation. Most methods reduce the factors in a dataset (feature engineering) in a subjective and potentially oversimplified manner. Deep neural networks such as Multilayer Perceptrons (MPL) and Convolutional Neural Networks (CNN) can overcome this by allowing the data itself to determine which factors are most important. CNN models were developed for predicting agronomic yield from a combination of replicated trials and historical yield survey data. The results were more accurate than standard methods when tested on held-out G, E, and M data (r = 0.50 vs. r = 0.43), and performed slightly worse than standard methods when only G was held out (r = 0.74 vs. r = 0.80). Pre-training on historical data increased accuracy compared to trial data alone. Saliency map analysis indicated the CNN has “learned” to prioritize many factors of known agricultural importance.
Background: Technology to restore the ability to communicate in paralyzed persons who cannot speak has the potential to improve autonomy and quality of life. An approach that decodes words and sentences directly from the cerebral cortical activity of such patients may represent an advancement over existing methods for assisted communication.
Methods: We implanted a subdural, high-density, multielectrode array over the area of the sensorimotor cortex that controls speech in a person with anarthria (the loss of the ability to articulate speech) and spastic quadriparesis caused by a brain-stem stroke. Over the course of 48 sessions, we recorded 22 hours of cortical activity while the participant attempted to say individual words from a vocabulary set of 50 words. We used deep-learning algorithms to create computational models for the detection and classification of words from patterns in the recorded cortical activity. We applied these computational models, as well as a natural-language model that yielded next-word probabilities given the preceding words in a sequence, to decode full sentences as the participant attempted to say them.
Results: We decoded sentences from the participant’s cortical activity in real time at a median rate of 15.2 words per minute, with a median word error rate of 25.6%. In post hoc analyses, we detected 98% of the attempts by the participant to produce individual words, and we classified words with 47.1% accuracy using cortical signals that were stable throughout the 81-week study period.
Conclusions: In a person with anarthria and spastic quadriparesis caused by a brain-stem stroke, words and sentences were decoded directly from cortical activity during attempted speech with the use of deep-learning models and a natural-language model. (Funded by Facebook and others; ClinicalTrials.gov number, NCT03698149.)
Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures.
To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called Scarecrow. The error categories used in Scarecrow—such as redundancy, commonsense errors, and incoherence—were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text.
We use Scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels, from GPT-2-small through the largestGPT-3-175b. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique.
Our results show both expected and surprising differences across these settings. These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems. We release our complete annotation toolkit and dataset at Github.
Scaling pays off to improve Encyclopedic, Commonsense, and Incoherent errors (Figure 2).
These error categories decrease with in-domain training (GROVER) and larger model size (GPT-3). Human text still shows the fewest of these kinds of errors.
Scaling benefits plateau for Off-Prompt, Bad Math, and Grammar & Usage errors (Figure 2).
These 3 error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt (more: §6.1) and Grammar & Usage errors, but Bad Math appears saturated for our domain.
Self-Contradiction and Redundant errors exhibit more complex scaling behavior (Figure 2).
We roughly categorize these trends as rising and falling: increasing for medium or large-scale models, but dropping for human-authored text. Further analysis (§6.2, §6.3) reveals these more complex patterns are affected both by interactions with other error types, as well how errors are counted.
Human-authored text produces the most reader issues (Figure 2–3).
The Needs Google and Technical Jargon span categories both have a humans highest trend, and both fall under reader issues: problems that are not necessarily errors, but that still prevent full comprehension or factual verification of the text (more: §6.4).
Furthermore, human-authored text is not free from error annotations (Figure 3). This can serve either as a control for baseline error rates (more: §6.6), or as a mechanism for critiquing human writing.
Decoding hyperparameters have a huge impact (Figure).
For the previous findings, we fix the sampling configuration for all models to an apples-to-apples setup for fair comparison: top-p = 0.96, (softmax) temperature = 1, and no frequency penalty (i.e., word repetition penalty; defined precisely in §5.2, Equation 1). To study the effects of these decoding settings, we annotate text generated by GPT-3 using a variety of values for top-p and temperature, both with and without a frequency penalty.
To our surprise, the decoding hyperparameters considerably affected error rates (more: §6.5). As seen in Figure 4, the worst sampling procedure for GPT-3 (argmax sampling with no frequencypenalty) performed even worse than GPT-2 XL. But the best sampling procedure (surprisingly, also argmax sampling, but with a frequency penalty) produced text with as few apparent SCARECROW error spans as those authored by humans (more: §6.6).
…We notice that a greater portion of errors in human-authored text were due to artifacts present in the text-only format of the Common Crawl. For example, links to other articles or advertisements sometimes appear in the middle of an article’s text. While annotators were quick to mark these spans, they reflect errors in formatting, not in writing. We partition these errors separately and exclude them from the subsequent calculations. GPT-3’s generations also sometimes exhibited what appeared to be formatting errors due to training on web-scraped text, though more rarely. For example, some generations contained Which? after vague noun phrases, which appear to be learned from Wikipedia, where under-specified information is tagged by an editor with this word. For fairness, we removed these errors from GPT-3’s tally as well, though they were few enough we do not plot them separately.
2021-jouppi.pdf: “Ten Lessons From Three Generations Shaped Google’s TPUv4i”, Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, David Patterson (2021-06-14):
Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semiconductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSAs); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5× annually; DNN advances evolve workloads; someinference tasks require floating point; inference DSAs need air-cooling; apps limit latency, not batch size; and backwards ML compatibility helps deploy DNNs quickly. These lessons molded TPUv4i,an inference DSA deployed since 2020.
Document the unequal improvement in logic, wires, SRAM, and DRAM from 45 nm to 7 nm—including an update of Horowitz’s operation energy table16 from 45 nm to 7 nm—and show how these changes led to 4 systolic floating point matrix units for TPUv4i in 2020 versus one systolicinteger matrix unit for TPUv1 in 2015.
Explain the difference between designing for performance per TCO vs per CapEx, leading to HBM and a low TDP for TPUv4i, and showhow TPUv1’s headroom led to application scaleup after the 2017 paper21.
Explain backwards ML compatibility, including why inference can need floating point and how it spurred the TPUv4i and TPUv4designs (§3). Backwards ML compatible training also tailors DNNsto TPUv4i (§2).
Measure production inference applications to show that DSAsnormally run multiple DNNs concurrently, requiring Google inference DSAs to support multi-tenancy.
Discuss how DNN advances change the production inference workload. The 2020 workload keeps MLP andCNN from 2017 but adds BERT, andRNN succeeds LSTM.
Document the growth of production DNNs in memory size and computation by ~1.5× annually since 2016, which encourages designing DSAs with headroom.
Show that Google’s TCO and TDP for DNN DSAs are strongly correlated (r = 0.99), likely due to the end of Dennard scaling. TDP offers a good proxy for DSA TCO.
Document that the SLO limit is P99 time for inference applications, list typical batch sizes, and show how large on-chip SRAM helps P99 performance.
Explain why TPUv4i architects chose compiler compatibility over binary compatibility for its VLIWISA.
Describe Google’s latest inference accelerator in production since March 2020 and evaluate its performance/TDP vs. TPUv3 andNVIDIA’s T4 inferenceGPUusing production apps and MLPerf Inference benchmarks 0.5–0.7.
…TPUv1 required quantization—since it supported only integer arithmetic—which proved a problem for some datacenter applications. Early in TPUv1 development, application developers said a 1% drop in quality was acceptable, but they changed their minds by the time the hardware arrived, perhaps because DNN overall quality improved so that 1% added to a 40% error was relatively small but 1% added to a 12% error was relatively large.
…Alas, DNN DSA designers often ignore multi-tenancy. Indeed,multi-tenancy is not mentioned in the TPUv1 paper21. (It was lucky that the smallest available DDR3DRAM held 8GB, allowing TPUv1 software to add multi-tenancy.)
…BERT appeared in 2018, yet it’s already 28% of the workload.
Modern law enforcement agencies strive to identify current trends and developments in Darknet markets. Extracting information from such markets requires knowledge about the contained entities, which can be extracted via Named Entity Recognition (NER).
Modern NER models are trained via supervised learning, which requires an annotated dataset, but such datasets for specific application domains, e.g. drug detection in Darknet markets, are rarely available. In this work, we created a NER dataset focused on drugs in Darknet markets and evaluated resources and techniques for domain and task adaptation of our NER models. The dataset, with about 3,500 item listings, was created via crowd-Sourcing and refined via a manual review. It is approximately 4 times the size of the only other available NER dataset for Darknet markets, we were aware of at this time.
We found that we were able to improve our NER prediction performance by ‘domain adaptation’ via fine-tuning our language models on Darknet item descriptions and reduced versions of Wikipedia texts about illicit drugs. Our models were able to predict drug entities with a F1-Score of up to 84.04 points according to the CoNLL2003NER evaluation metric.
[Keywords: NER, Named Entity Recognition, noisy user-generated text, darknet, drug detection, crowd-sourcing, Mechanical Turk]
…The Darknet data is loaded from 2 primary sources, the Darknet Market Archives [BCDH+15] and AZSecure-data [DZE+18].
The Darknet Market Archives contain multiple datasets about Darknet Market platforms and forums. We only used the “grams” dataset. This dataset contains nearly daily scrapes of multiple market platforms (e.g. “Agora”). We chose to use the last date where these markets were scraped “2015-07-12” and only a subset of these markets, namely: “Abraxas”, “Agora”, “Alpha”, “ME”, and “Oxygen”. This dataset was only used for adjusting our language models to the target domain, called domain adaptation (see section 2.1). For the dataset creation we used a dataset from AZSecure-data, which was scraped from a platform called “Dream Market”. At this time it was the largest Darknet market platform according to [DZE+18]. The data was collected from 2013 to 2017 and contained 91,463 listings of which 61,420 were found in a category associated with drugs. The dataset contains a variety of product and vendor information.
In scope of this work, we were only interested in the product name and description. The item description was used for the annotation of named entities and the product name, was used to provide context to the annotators. However, other types of information were used during the pre-processing for pseudonymization purposes. The pseudonymization included removing all vendor names from the item listings, removing email addresses and telephone numbers and all links found in the dataset (those might also identify a vendor profile). A recent example for a drug item listing, which was online at the time of our project, can be seen in Figure 3.1.
Our experiment design required further datasets as representatives for standard NER corpora and text corpora with noisy user-generated data.Our standard NER text corpus is the well-known CoNLL2003NER dataset[TKSDM03], which is based on newswire texts annotated with Person, Location, Organization and Miscellaneous entities. As representatives for the noisy user-generated text datasets we chose the Broad Twitter Corpus [DBR16] and the WNUT 2017 dataset [DNEL17]. The Broad Twitter Corpus contains 9,551 Tweets with annotations for entities of type Person, Location and Organization. The WNUT 2017 dataset contains 2,295 text from various sources (Reddit, Twitter, YouTube, and StackExchange comments) with annotations for Person, Location, Corporation, Product, Creative-Work and Group as named entity types. Furthermore, we used the extension from Al-Nabki [NFAFR20] of theWNUT 2017 dataset called “NuToT”. This dataset version is extended by Darknet market listings, which advertise illicit goods.
The People’s Liberation Army (PLA) seeks not only to equal but also to overtake the US military through seizing the initiative in the ongoing Revolution in Military Affairs (RMA). Chinese military leaders believe the form of warfare is changing from today’s ‘informatised’ (信息化) warfare to future ‘intelligentised’ (智能化) warfare.
The PLA’s approach to leveraging emerging technologies is likely to differ from parallel American initiatives because of its distinct strategic culture, organisational characteristics, and operational requirements. This research examines the evolution of the PLA’s strategic thinking and concepts of operations, seeking to contribute to the military innovation literature by evaluating major theoretical frameworks for the case of China.
The widespread use of experimental benchmarks in AI research has created competition and collaboration dynamics that are still poorly understood. Here we provide an innovative methodology to explore these dynamics and analyse the way different entrants in these challenges, from academia to tech giants, behave and react depending on their own or others’ achievements.
We perform an analysis of 25 popular benchmarks in AI from Papers With Code [HMDB-51 · UCF101 ·Montezuma’s Revenge · Space Invaders · CIFAR-100 ·ImageNet · CIFAR-10 · Set5 · Enwik8 · Penn Treebank · WN18RR · WMT2014English-French · WMT2014 English-German · CoNLL 2003 · Ontonotes v5 ·COCO Minival · COCO test-dev · MPII Human Pose · SQuAD1.1 · WikiQA · Cityscapes test · PASCALVOC2012 test · IMD · SST-2 Binary classification · LibriSpeech], with around 2,000 result entries overall, connected with their underlying research papers. We identify links between researchers and institutions (that is, communities) beyond the standard co-authorship relations, and we explore a series of hypotheses about their behaviour as well as some aggregated results in terms of activity, performance jumps and efficiency. We characterize the dynamics of research communities at different levels of abstraction, including organization, affiliation, trajectories, results and activity.
We find that hybrid, multi-institution and persevering communities are more likely to improve state-of-the-art performance, which becomes a watershed for many community members.
Although the results cannot be extrapolated beyond our selection of popular machine learning benchmarks, the methodology can be extended to other areas of artificial intelligence or robotics, and combined with bibliometric studies.
…Figure 1 represents the results for the ImageNet dataset, which consists of 1.2 million images in 1,000 classes. The results of the different communities show that several long-term collaborative ‘hybrid’ groups, formed mostly by American universities (Johns Hopkins, University of California, Los Angeles, Cornell, Stanford, Toronto and so on) in collaboration with tech giants (Microsoft and Google) are those that have dominated the SOTA front from early on (communities numbered as #1 and #2). Although hybrid communities dominate the SOTA front, there are also some isolated company players, possibly representing different divisions, departments and research groups from companies such as Google, Xiaomi, Facebook and Microsoft. However, only a single non-hybrid community, Google, is able to achieve a score on the SOTA front.
…While all benchmark plots can be found in the Supplementary Information (Supplementary Figs. 4–6), we include another example here, in Figure 2. This is the Stanford Question Answering Dataset (SQuAD1.1), a reading comprehension benchmark with more than 100,000 question–answer pairs from more than 500 articles. Questions derive from Wikipedia articles where the answer may be a segment of text from the corresponding reading passage, or may be unanswerable (for example, written adversarially to look similar to answerable ones). Like ImageNet, the SOTA front is dominated by hybrid long-term collaboration groups (communities numbered #2 and #3) formed by American universities (Stanford, Carnegie Mellon, Washington and so on) in collaboration with tech giants (Facebook and Google), but also by large hybrid communities formed by Asian universities (Beihang, Fudan, Peking and so on) jointly with Microsoft (community #1). We also observe that the participation of European universities initiatives is very low. Unlike ImageNet, most entries correspond to the period between 2016 and 2018, with a clear decline in activity from 2018 to 2020. This is probably due to the introduction of the new (and much more difficult) version of the benchmark (SQuAD2.0), with attention moving to the new challenge. However, SQuAD1.1 is still being addressed by communities 2 and 3, which have participated since 2016 and have led the SOTA from 2018 to 2020. Again, we see that long-term collaborative groups obtain better results than isolated communities
…From the above results, we reach a number of conclusions about the dynamics of communities engaging with AI benchmarks. We find that (1) SOTA jumps are mainly obtained by multi-institution communities, compared with the number of jumps obtained by single-institution communities; (2) multi-attempt communities are more likely to achieve SOTA jumps (compared with one-shot efforts); (3) jumps are mainly obtained by hybrid communities involving both universities and companies, meaning that heterogeneous communities achieve more success through collaborative efforts compared with ‘pure’ communities (only universities or companies); and, finally, (4) the presence of companies in a community, such as Google, Microsoft and Facebook, increases the odds of achieving a jump in an AI benchmark. All the above reinforces the usefulness of the increasing tendency of collaboration between universities and industry in AI research.
…While institutions from the United States represent about 56.7% of all jumps, China only represents about 18%. However, the gap becomes smaller if we only consider the recent years. For instance, Table 3 (orange) shows the same results for year 2019 only. Here the institutions from Asia come at the forefront in terms of activity compared with those from America. At the country level, activities from the United States and China are much more similar (41% versus 37%) and although the United States keeps leading the chart with respect to to the number of SOTA jumps compared with China (54% versus 26%), the difference has narrowed. This country-level concentration is also reflected when we compute the (country-wise) HHIto analyse concentration and competitiveness. In this case, the HHI is 0.33, showing a much higher concentration level per country compared with the analysis per institution.
These results are loosely consistent with analyses framing AI research progress as a ‘race’ being led by the United States and China…The data we analyse here represent only a small snapshot of all AI progress, but it still suggests that the United States has had a relevant lead if we look at the whole period, but the gap is being reduced by other countries such as China (Table 3). In the whole period, as Figure 3 shows, 6 out of the top 10 institutions are from the United States (the top 3 being tech giants).
[Poster] In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail.
In some situations we show that neural networks learn through a process of grokking a pattern in the data, improving generalization performance from random-chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting.
We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.
…We find that adding weight decay has a very large effect on data efficiency, more than halving the amount of samples needed compared to most other interventions. We found that weight decay towards the initialization of the network is also effective, but not quite as effective as weight decay towards the origin. This makes us believe that the prior, that approximately zero weights are suitable for small algorithmic tasks, explains part, but not all of the superior performance of weight decay. Adding some noise to the optimization process (eg. gradient noise from using minibatches, Gaussian noise applied to weights before or after computing the gradients) is beneficial for generalization, consistent with the idea that such noise might induce the optimization to find flatter minima that generalize better. We found that learning rate had to be tuned in a relatively narrow window for the generalization to happen (within 1 order of magnitude).
Biological cognition is based on self-generated learning objectives. However, the mechanism by which this epistemic autonomy is realized by the neuronal substrate is not understood.
Artificial neural networks based on error backpropagation lack epistemic autonomy because they are mostly trained in a supervised fashion. In this respect, they face the symbol grounding problem of artificial intelligence.
We propose that the entorhinal-hippocampal complex, a brain structure located in the medial temporal lobe and central to memory, combines epistemic autonomy with intrinsically generated error gradients akin to error backpropagation.
We present evidence supporting the hypothesis that the counter-current inhibitory projections of the entorhinal-hippocampal complex implement a continuous self-supervised error minimization between network input and output.
Biological cognition is based on the ability to autonomously acquire knowledge, or epistemic autonomy.
Such self-supervision is largely absent in artificial neural networks (ANN) because they depend on externally set learning criteria. Yet training ANN using error backpropagation has created the current revolution in artificial intelligence, raising the question of whether the epistemic autonomy displayed in biological cognition can be achieved with error backpropagation-based learning.
We present evidence suggesting that the entorhinal-hippocampal complex combines epistemic autonomy with error backpropagation. Specifically, we propose that the hippocampus minimizes the error between its input and output signals through a modulatory counter-current inhibitory network. We further discuss the computational emulation of this principle and analyze it in the context of autonomous cognitive systems.
Artwork is increasingly being created by machines through algorithms with little or no input from humans. Yet, very little is known about people’s attitudes and evaluations of artwork generated by machines. The current study investigates (a) whether individuals are able to accurately differentiate human-made artwork from AI-generated artwork and (b) the role of attribution knowledge (i.e., information about who created the content) in their evaluation and reception of artwork. Data was collected using an Amazon Turk sample from two survey experiments designed on Qualtrics. Findings suggest that individuals are unable to accurately identify AI-generated artwork and they are likely to associate representational art to humans and abstract art to machines. There is also an interaction effect between attribution knowledge and the type of artwork (representational vs. abstract) on purchase intentions and evaluations of artworks.
Five years ago, few would have predicted that a software company like Google would build its own computers. Nevertheless, Google has been deploying computers for machine learning (ML) training since 2017, powering key Google services. These Tensor Processing Units (TPUs) are composed of chips, systems, and software, all co-designed in-house. In this paper, we detail the circumstances that led to this outcome, the challenges and opportunities observed, the approach taken for the chips, a quick review of performance, and finally a retrospective on the results. A companion paper describes the supercomputers built from these chips, the compiler, and a detailed performance analysis [Jou20].
Understanding the degree to which human facial expressions co-vary with specific social contexts across cultures is central to the theory that emotions enable adaptive responses to important challenges and opportunities. Concrete evidence linking social context to specific facial expressions is sparse and is largely based on survey-based approaches, which are often constrained by language and small sample sizes. Here, by applying machine-learning methods to real-world, dynamic behaviour, we ascertain whether naturalistic social contexts (for example, weddings or sporting competitions) are associated with specific facial expressions across different cultures. In two experiments using deep neural networks, we examined the extent to which 16 types of facial expression occurred systematically in thousands of contexts in 6 million videos from 144 countries. We found that each kind of facial expression had distinct associations with a set of contexts that were 70% preserved across 12 world regions. Consistent with these associations, regions varied in how frequently different facial expressions were produced as a function of which contexts were most salient. Our results reveal fine-grained patterns in human facial expressions that are preserved across the modern world.
Online misinformation has become a constant; only the way actors create and distribute that information is changing. Advances in artificial intelligence (AI) such as GPT-2 mean that actors can now synthetically generate text in ways that mimic the style and substance of human-created news stories. We carried out three original experiments to study whether these AI-generated texts are credible and can influence opinions on foreign policy. The first evaluated human perceptions of AI-generated text relative to an original story. The second investigated the interaction between partisanship and AI-generated news. The third examined the distributions of perceived credibility across different AI model sizes. We find that individuals are largely incapable of distinguishing between AI-generated and human-generated text; partisanship affects the perceived credibility of the story; and exposure to the text does little to change individuals’ policy views. The findings have important implications in understanding AI in online misinformation campaigns.
[Keywords: misinformation, disinformation, foreign policy, public opinion, media]
Interest in deciphering the fundamental mechanisms and processes of the human mind represents a central driving force in modern neuroscience research. Activities in support of this goal rely on advanced methodologies and engineering systems that are capable of interrogating and stimulating neural pathways, from single cells in small networks to interconnections that span the entire brain. Recent research establishes the foundations for a broad range of creative neurotechnologies that enable unique modes of operation in this context. This review focuses on those systems with proven utility in animal model studies and with levels of technical maturity that suggest a potential for broad deployment to the neuroscience community in the relatively near future. We include a brief summary of existing and emerging neuroscience techniques, as background for a primary focus on device technologies that address associated opportunities in electrical, optical and microfluidic neural interfaces, some with multimodal capabilities. Examples of the use of these technologies in recent neuroscience studies illustrate their practical value. The vibrancy of the engineering science associated with these platforms, the interdisciplinary nature of this field of research and its relevance to grand challenges in the treatment of neurological disorders motivate continued growth of this area of study.
In the last 20 years the Turing test has been left further behind by new developments in artificial intelligence. At the same time, however, these developments have revived some key elements of the Turing test: imitation and adversarialness. On the one hand, many generative models, such as generative adversarial networks (GAN), build imitators under an adversarial setting that strongly resembles the Turing test (with the judge being a learnt discriminative model). The term “Turing learning” has been used for this kind of setting. On the other hand, AI benchmarks are suffering an adversarial situation too, with a ‘challenge-solve-and-replace’ evaluation dynamics whenever human performance is ‘imitated’. The particular AI community rushes to replace the old benchmark by a more challenging benchmark, one for which human performance would still be beyond AI. These two phenomena related to the Turing test are sufficiently distinctive, important and general for a detailed analysis. This is the main goal of this paper. After recognising the abyss that appears beyond superhuman performance, we build on Turing learning to identify two different evaluation schemas: Turing testing and adversarial testing. We revisit some of the key questions surrounding the Turing test, such as ‘understanding’, commonsense reasoning and extracting meaning from the world, and explore how the new testing paradigms should work to unmask the limitations of current and future AI. Finally, we discuss how behavioural similarity metrics could be used to create taxonomies for artificial and natural intelligence. Both testing schemas should complete a transition in which humans should give way to machines—not only as references to be imitated but also as judges—when pursuing and measuring machine intelligence.
Despite being the workhorse of deep learning, the backpropagation algorithm is no panacea. It enforces sequential layer updates, thus preventing efficient parallelization of the training process. Furthermore, its biological plausibility is being challenged. Alternative schemes have been devised; yet, under the constraint of synaptic asymmetry, none have scaled to modern deep learning tasks and architectures. Here, we challenge this perspective, and study the applicability of Direct Feedback Alignment (DFA) to neural view synthesis, recommender systems, geometric learning, and natural language processing. In contrast with previous studies limited to computer vision tasks, our findings show that it successfully trains a large range of state-of-the-art deep learning architectures, with performance close to fine-tuned backpropagation. When a larger gap between DFA and backpropagation exists, like in Transformers, we attribute this to a need to rethink common practices for large and complex architectures. At variance with common beliefs, our work supports that challenging tasks can be tackled in the absence of weight transport.
This chapter explores the creators and potential consumers of sex robots.
With Realbotix as our case study, we take a closer look at the language and sentiments of those developing the technology and those who are testing, consuming, or showing an interest in it. We do this by means of website and chat forum analysis, and via interviews with those involved.
From this, we can see the motivation for developing a sexual companion robot places the emphasis firmly on the companionship aspect, and that those involved in creating and consuming the products share an ideology of intimacy and affection, with sexual gratification only playing a minor role.
In this paper, we present GrokNet, a deployed image recognition system for commerce applications. GrokNet leverages a multi-task learning approach to train a single computer vision trunk. We achieve a 2.1× improvement in exact product match accuracy when compared to the previous state-of-the-art Facebook product recognition system. We achieve this by training on 7 datasets across several commerce verticals, using 80 categorical loss functions and 3 embedding losses. We share our experience of combining diverse sources with wide-ranging label semantics and image statistics, including learning from human annotations, user-generated tags, and noisy search engine interaction data. GrokNet has demonstrated gains in production applications and operates at Facebook scale.
The global burden of diabetes is rapidly increasing, from 451 million people in 2019 to 693 million by 2045. The insidious onset of type 2 diabetes delays diagnosis and increases morbidity. Given the multifactorial vascular effects of diabetes, we hypothesized that smartphone-based photoplethysmography could provide a widely accessible digital biomarker for diabetes. Here we developed a deep neural network (DNN) to detect prevalent diabetes using smartphone-based photoplethysmography from an initial cohort of 53,870 individuals (the ‘primary cohort’), which we then validated in a separate cohort of 7,806 individuals (the ‘contemporary cohort’) and a cohort of 181 prospectively enrolled individuals from three clinics (the ‘clinic cohort’). The DNN achieved an area under the curve for prevalent diabetes of 0.766 in the primary cohort (95% confidence interval: 0.750–0.782; sensitivity 75%, specificity 65%) and 0.740 in the contemporary cohort (95% confidence interval: 0.723–0.758; sensitivity 81%, specificity 54%). When the output of the DNN, calledthe DNN score, was included in a regression analysis alongside age, gender, race/ethnicity and body mass index, the area under the curve was 0.830 and the DNN score remained independently predictive ofdiabetes. The performance of the DNN in the clinic cohort was similar to that in other validation datasets. There was a statistically-significant and positive association between the continuous DNN score and hemoglobin A1c (p ≤ 0.001) among those with hemoglobin A1c data. These findings demonstrate that smartphone-based photoplethysmography provides a readily attainable, non-invasive digital biomarker of prevalent diabetes.
If the structure of language vocabularies mirrors the structure of natural divisions that are universally perceived, then the meanings of words in different languages should closely align. By contrast, if shared word meanings are a product of shared culture, history and geography, they may differ between languages in substantial but predictable ways. Here, we analysed the semantic neighbourhoods of 1,010 meanings in 41 languages. The most-aligned words were from semantic domains with high internal structure (number, quantity and kinship). Words denoting natural kinds, common actions and artefacts aligned much less well. Languages that are more geographically proximate, more historically related and/or spoken by more-similar cultures had more aligned word meanings. These results provide evidence that the meanings of common words vary in ways that reflect the culture, history and geography of their users.
We compare the impact of hardware advancement and algorithm advancement for SAT solving over the last two decades. In particular, we compare 20-year-old SAT-solvers on new computer hardware withmodern SAT-solvers on 20-year-old hardware. Our findings show that the progress on the algorithmic side has at least as much impact as the progress on the hardware side.
Deep neural networks have been revolutionizing the field of machine learning for the past several years. They have been applied with great success in many domains of the biomedical data sciences and are outperforming extant methods by a large margin. The ability of deep neural networks to pick up local image features and model the interactions between them makes them highly applicable to regulatory genomics. Instead of an image, the networks analyze DNA and RNA sequences and additional epigenomic data. In this review, we survey the successes of deep learning in the field of regulatory genomics. We first describe the fundamental building blocks of deep neural networks, popular architectures used in regulatory genomics, and their training process on molecular sequence data. We then review several key methods in different gene regulation domains. We start with the pioneering method DeepBind and its successors, which were developed to predict protein–DNA binding. We then review methods developed to predict and model epigenetic information, such as histone marks and nucleosome occupancy. Following epigenomics, we review methods to predict protein–RNA binding with its unique challenge ofincorporating RNA structure information. Finally, we provide our overall view of the strengths and weaknesses of deep neural networks and prospects for future developments.
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
The inferotemporal (IT) cortex is responsible for object recognition, but it is unclear how the representation of visual objects is organized in this part of the brain. Areas that are selective for categories such as faces, bodies, and scenes have been found1,2,3,4,5, but large parts of IT cortex lack any known specialization, raising the question of what general principle governs IT organization. Here we used functional MRI, microstimulation, electrophysiology, and deep networks to investigate the organization of the macaque IT cortex. We built a low-dimensional object space to describe general objects using a feedforward deep neural network trained on object classification6. Responses of IT cells to a large set of objects revealed that single IT cells project incoming objects onto specific axes of this space. Anatomically, cells were clustered into four networks according to the first two components of their preferred axes, forming a map of object space. This map was repeated across three hierarchical stages of increasing view invariance, and cells that comprised these maps collectively harboured sufficient coding capacity to approximately reconstruct objects. These results provide a unified picture of IT organization in which category-selective regions are part of a coarse map of object space whose dimensions can be extracted from a deep network.
Companies use about 300,000 times more computation training the best AI systems today than they did in 2012 and algorithmic innovations have also made them 25 times more efficient at the same tasks.
These are the headline results of two recent papers—“AI and Compute” and “AI and Efficiency”—from the Foresight Team at OpenAI. In today’s episode I spoke with one of the authors, Danny Hernandez, who joined OpenAI after helping develop better forecasting methods at Twitch and Open Philanthropy. Danny and I talk about how to understand his team’s results and what they mean (and don’t mean) for how we should think about progress in AI going forward.
Debates around the future of AI can sometimes be pretty abstract and theoretical. Danny hopes that providing rigorous measurements of some of the inputs to AI progress so far can help us better understand what causes that progress, as well as ground debates about the future of AI in a better shared understanding of the field…In the interview, Danny and I also discuss a range of other topics, including:
The question of which experts to believe
Danny’s journey to working at OpenAI
The usefulness of “decision boundaries”
The importance of Moore’s law for people who care about the long-term future
What OpenAI’s Foresight Team’s findings might imply for policy
The question whether progress in the performance of AI systems is linear
The safety teams at OpenAI and who they’re looking to hire
One idea for finding someone to guide your learning
The importance of hardware expertise for making a positive impact
If you believe AI progress is fast, what would progress look like that would convince you it’s slow? Paint a picture of that five years from now. What does slow progress look like to you? And now you’re like, “Oh yeah, progress is actually slow”. And what could have happened that would convince you that it’s actually fast. But you can make what would update you clear to yourself and others and that for big decisions, this is generally worthwhile.
Three factors drive the advance of AI: algorithmic innovation, data, and the amount of compute available for training. Algorithmic progress has traditionally been more difficult to quantify than compute and data. In this work, we argue that algorithmic progress has an aspect that is both straightforward to measure and interesting: reductions over time in the compute needed to reach past capabilities. We show that the number of floating-point operations required to train a classifier to AlexNet-level performance on ImageNet has decreased by a factor of 44× between 2012 and 2019. This corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years. By contrast, Moore’s Law would only have yielded an 11× cost improvement. We observe that hardware and algorithmic efficiency gains multiply and can be on a similar scale over meaningful horizons, which suggests that a good model of AI progress should integrate measures from both.
Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet (by contrast, Moore’s Law3 would yield an 11× cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.
…For our analysis, we primarily leveraged open-source re-implementations19,20,21 to measure progress on AlexNet level performance over a long horizon. We saw a similar rate of training efficiency improvement for ResNet-50 level performance on ImageNet (17-month doubling time).7,16 We saw faster rates of improvement over shorter timescales in Translation, Go, and DoTA 2:
Within translation, the Transformer22 surpassed seq2seq23performance on English to French translation on WMT’14 with 61× less training compute 3 years later.
We estimate AlphaZero24 took 8× less compute to get to AlphaGo Zero25 level performance 1 year later.
OpenAI Five Rerun required 5× less training compute to surpass OpenAI Five26 (which beat the world champions, OG) 3 months later.
It can be helpful to think of compute in 2012 not being equal to compute in 2019 in a similar way that dollars need to be inflation-adjusted over time. A fixed amount of compute could accomplish more in 2019 than in 2012. One way to think about this is that some types of AI research progress in two stages, similar to the “tick tock” model of development seen in semiconductors; new capabilities (the “tick”) typically require a substantial amount of compute expenditure to obtain, then refined versions of those capabilities (the “tock”) become much more efficient to deploy due to process improvements. Increases in algorithmic efficiency allow researchers to do more experiments of interest in a given amount of time and money. In addition to being a measure of overall progress, algorithmic efficiency gains speed up future AI research in a way that’s somewhat analogous to having more compute.
…We also find increases in inference efficiency in terms of GPU time32, parameters16, and flops meaningful, but mostly as a result of their economic implications [ Inference costs dominate total costs for successful deployed systems. Inference costs scale with usage of the system, whereas training costs only need to be paid once.] rather than their effect on future research progress. ShuffleNet13 achieved AlexNet-level performance with an 18× inference efficiency increase in 5 years (15-month doubling time), which suggests that training efficiency and inference efficiency might improve at similar rates.
…For all these reasons, we’re going to start tracking efficiency SOTAs publicly. We’ll start with vision and translation efficiency benchmarks (ImageNetand WMT14), and we’ll consider adding morebenchmarks over time. We believe there are efficiency SOTAs on these benchmarks we’re unaware of and encourage the research community to submit them here (we’ll give credit to original authors and collaborators).
During learning, the brain modifies synapses to improve behaviour. In the cortex, synapses are embedded within multilayered networks, making it difficult to determine the effect of an individual synaptic modification on the behaviour of the system. The backpropagation algorithm solves this problem in deep artificial neural networks, but historically it has been viewed as biologically problematic. Nonetheless, recent developments in neuroscience and the successes of artificial neural networks have reinvigorated interest in whether backpropagation offers insights for understanding learning in the cortex. The backpropagation algorithm learns quickly by computing synaptic updates using feedback connections to deliver error signals. Although feedback connections are ubiquitous in the cortex, it is difficult to see how they could deliver the error signals required by strict formulations of backpropagation. Here we build on past and recent developments to argue that feedback connections may instead induce neural activities whose differences can be used to locally approximate these signals and hence drive effective learning in deep networks in the brain.
Human sketches can be expressive and abstract at the same time. Generating anime avatars from simple or even bad face drawing is an interesting area. Lots of related work has been done such as auto-coloring sketches to anime or transforming real photos to anime. However, there aren’t many interesting works yet to show how to generate anime avatars from just some simple drawing input. In this project, we propose using GAN to generate anime avatars from sketches.
Evolution is a blind fitting process by which organisms become adapted to their environment. Does the brain use similar brute-force fitting processes to learn how to perceive and act upon the world? Recent advances in artificial neural networks have exposed the power of optimizing millions of synaptic weights over millions of observations to operate robustly in real-world contexts. These models do not learn simple, human-interpretable rules or representations of the world; rather, they use local computations to interpolate over task-relevant manifolds in a high-dimensional parameter space. Counterintuitively, similar to evolutionary processes, over-parameterized models can be simple and parsimonious, as they provide a versatile, robust solution for learning a diverse set of functions. This new family of direct-fit models present a radical challenge to many of the theoretical assumptions in psychology and neuroscience. At the same time, this shift in perspective establishes unexpected links with developmental and ecological psychology.
Obtaining venous access for blood sampling or intravenous (IV) fluid delivery is an essential first step in patient care. However, success rates rely heavily on clinician experience and patient physiology. Difficulties in obtaining venous access result in missed sticks and injury to patients, and typically require alternative access pathways and additional personnel that lengthen procedure times, thereby creating unnecessary costs to healthcare facilities.
Here, we present the first-in-human assessment of an automated robotic venipuncture device designed to safely perform blood draws on peripheral forearm veins. The device combines ultrasound imaging and miniaturized robotics to identify suitable vessels for cannulation and robotically guide an attached needle toward the lumen center. The device demonstrated results comparable to or exceeding that of clinical standards, with a success rate of 87% on all participants (n = 31), a 97% success rate on non-difficult venous access participants (n = 25), and an average procedure time of 93 ± 30 s (n = 31).
In the future, this device can be extended to other areas of vascular access such as IV catheterization, central venous access, dialysis, and arterial line placement.
2019-richards.pdf: “A deep learning framework for neuroscience”, Blake A. Richards, Timothy P. Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy Berker, Surya Ganguli, Colleen J. Gillon, Danijar Hafner, Adam Kepecs, Nikolaus Kriegeskorte, Peter Latham, Grace W. Lindsay, Kenneth D. Miller, Richard Naud, Christopher C. Pack, Panayiota Poirazi, Pieter Roelfsema, João Sacramento, Andrew Saxe, Benjamin Scellier, Anna C. Schapiro, Walter Senn, Greg Wayne, Daniel Yamins, Friedemann Zenke, Joel Zylberberg, Denis Therien, Konrad P. Kording
Artificial intelligence (AI) is surpassing human performance in a growing number of domains. However, there is limited evidence of its economic effects. Using data from a digital platform, we study a key application of AI: machine translation. We find that the introduction of a new machine translation system has substantially increased international trade on this platform, increasing exports by 10.9%. Furthermore, heterogeneous treatment effects are consistent with a substantial reduction in translation costs. Our results provide causal evidence that language barriers substantially hinder trade and that AI has already begun to improve economic efficiency in at least one domain.
The idea that the brain learns generative models of the world has been widely promulgated. Most approaches have assumed that the brain learns an explicit density model that assigns a probability to each possible state of the world. However, explicit density models are difficult to learn, requiring approximate inference techniques that may find poor solutions. An alternative approach is to learn an implicit density model that can sample from the generative model without evaluating the probabilities of those samples. The implicit model can be trained to fool a discriminator into believing that the samples are real. This is the idea behind generative adversarial algorithms, which have proven adept at learning realistic generative models. This paper develops an adversarial framework for probabilistic computation in the brain. It first considers how generative adversarial algorithms overcome some of the problems that vex prior theories based on explicit density models. It then discusses the psychological and neural evidence for this framework, as well as how the breakdown of the generator and discriminator could lead to delusions observed in some mental disorders.
…Our sensory inputs are impoverished, and yet our experience of the world feels richly detailed. For example, our fovea permits us access to a high fidelity region of the visual field only twice the size of our thumbnail held at arm’s length. But we don’t experience the world as though looking through a tiny aperture. Instead, our brains feed us a “grand illusion” of panoptic vision (Chater, 2018; Noe et al 2000; Odegaard et al 2018). Similarly, we receive no visual input in the region of the retina that connects to the optic nerve, yet under normal circumstances we are unaware of this blind spot. Moreover, even when we receive high fidelity visual input, we may still fail to witness dramatic changes in scenes (Simons, 2000), as though our brains have contrived imaginary scenes that displace the true scenes.
…First, how can we explain the phenomenology of illusion: why do some illusions feel real, as though one is actually seeing them, whereas other inferences carry information content without the same perceptual experience. For example, Ramachandran and Hirstein (1997) use the example of gazing at wallpaper in a bathroom, where the wallpaper in your visual periphery is ‘filled in’ (you subjectively experience it as high fidelity even though objectively you perceive it with low fidelity), but the wallpaper behind your head is not filled in. In other words, you infer that the wallpaper continues behind your head, and you may even know this with high confidence, but you do not have the experience of seeing the wallpaper behind your head. Thus, the vividness or “realness” of perceptual experience is not a simple function of belief strength. So what is it a function of? Second, how can we explain the peculiar ways that the inferential apparatus breaks down? In particular, how can we understand the origins of delusions, hallucinations, and confabulations that arise in certain mental disorders? While Bayesian models have been developed to explain these phenomena, they fall short in certain ways that we discuss later on.
Recently, deep convolutional neural networks (CNNs) have been widelyexplored in single image super-resolution (SISR) and obtained remarkable performance. However, most of the existing CNN-based SISR methods mainly focus on wider or deeper architecture design, neglecting to explore the feature correlations of intermediate layers, hence hindering the representational power of CNNs. To address this issue, in this paper, we propose a second-order attention network (SAN) for more powerful feature expression and feature correlation learning. Specifically, a novel trainable second-order channel attention (SOCA) module is developed to adaptively rescale the channel-wise features by using second-order feature statistics for more discriminative representations. Furthermore, we present a non-locally enhanced residual group (NLRG) structure, which not only incorporates non-local operations to capture long-distance spatial contextual information, but also contains repeated local-source residual attention groups (LSRAG) to learn increasingly abstract feature representations. Experimental results demonstrate the superiority of our SAN network over state-of-the-art SISR methods in terms of both quantitative metrics and visual quality.
Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators.
Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences.
These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.
We propose an efficient algorithm to embed a given image into the latent space of StyleGAN. This embedding enables semantic image editing operations that can be applied to existing photographs. Taking the StyleGAN trained on the FFHQ dataset as an example, we show results for image morphing, style transfer, and expression transfer. Studying the results of the embedding algorithm provides valuable insights into the structure of the StyleGAN latent space. We propose a set of experiments to test what class of images can be embedded, how they are embedded, what latent space is suitable for embedding, and if the embedding is semantically meaningful.
…Going beyond faces, interestingly, we find that although the FFHQStyleGAN generator is trained on a human face dataset, the embedding algorithm is capable to go far beyond human faces. As Figure 1 shows, although slightly worse than those of human faces, we can obtain reasonable and relatively high-quality embeddings of cats, dogs and even paintings and cars. This reveals the effective embedding capability of the algorithm and the generality of the learned filters of the generator.
The use of Artificial Intelligence (AI) machines using deep learning neural networks to create material that facially looks like it should be protected by copyright is growing exponentially. From articles in national news media to music, film, poetry and painting, AI machines create material that has economic value and that competes with productions of human authors. The Article reviews both normative and doctrinal arguments for and against the protection by copyright of literary and artistic productions made by AI machines.
The Article finds that the arguments in favor of protection are flawed and unconvincing and that a proper analysis of the history, purpose, and major doctrines of copyright law all lead to the conclusion that productions that do not result from human creative choices belong to the public domain.
The Article proposes a test to determine which productions should be protected, including in case of collaboration between human and machine. Finally, the Article applies the proposed test to three specific fact patterns to illustrate its application.
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets.
We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset—matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text.
These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Existing personal assistants and agents are by design limited in their ability to form or encourage close personal bonds.
The Harmony system is designed to be a customizable personal companion agent capable of close personal interaction via the user’s phone, virtual reality headset, as well as through a physical interactive android body. In this chapter, we will describe the history that led to Harmony’s creation, the unique challenges and the overall system design.
We will also look at user reactions to the system and anticipated future developments.
Particular deep artificial neural networks (ANNs) are today’s most accurate models of the primate brain’s ventral visual stream. Using an ANN-driven image synthesis method, we found that luminous power patterns (ie., images) can be applied to primate retinae to predictably push the spiking activity of targeted V4 neural sites beyond naturally occurring levels. This method, although not yet perfect, achieves unprecedented independent control of the activity state of entire populations of V4 neural sites, even those with overlapping receptive fields. These results show how the knowledge embedded in today’s ANN models might be used to noninvasively set desired internal brain states at neuron-level resolution, and suggest that more accurate ANN models would produce even more accurate control.
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images [3.3m training] than the MS-COCO dataset (Lin et al 2014) and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages.
We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNetv2 (Szegedy et al 2016) for image-feature extraction and Transformer (Vaswani et al 2017) for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset.
We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.
2018-davies.pdf: “Loihi: A Neuromorphic Manycore Processor with On–Chip Learning”, Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, Yuyun Liao, Chit-Kwan Lin, Andrew Lines, Ruokun Liu, Deepak Mathaikutty, Steven McCoy, Arnab Paul, Jonathan Tse, Guruguhanathan Venkataramanan, Yi-Hsin Weng, Andreas Wild, Yoonseok Yang, Hong Wang (2018-01-16):
Loihi is a 60-mm2 chip fabricated in Intel’s 14-nm process that advances the state-of-the-art modeling of spiking neural networks in silicon. It integrates a wide range of novel features for the field, such as hierarchical connectivity, dendritic compartments, synaptic delays, and, most importantly, programmable synaptic learning rules. Running a spiking convolutional form of the Locally Competitive Algorithm, Loihi can solve LASSO optimization problems with over 3 orders of magnitude superior energy-delay-product compared to conventional solvers running on a CPU iso-process/voltage/area. This provides an unambiguous example of spike-based computation, outperforming all known conventional solutions.
…Spiking Neural Networks: We consider an SNN a model of computation with neurons as the basic processing elements. Different from ANNs, SNNs incorporate time as an explicit dependency in their computations. At some instant in time, one or more neurons might send out single-bit impulses, the spike, to neighbors through directed connections known as synapses, with a potentially non-zero traveling time. Neurons have local state variables with rules governing their evolution and timing of spike generation. Hence, the network is a dynamical system where individual neurons interact through spikes
…Chip Overview: Loihi features a many-core mesh comprising 128 neuromorphic cores, 3 embedded x86 processor cores, and off-chip communication interfaces that hierarchically extend the mesh in 4 planar directions to other chips. An asynchronous network-on-chip (NoC) transports all communication between cores in the form of packetized messages. The NoC supports write, read request, and read response messages for core management and x86-to-x86 messaging, spike messages for SNN computation, and barrier messages for time synchronization between cores. All message types may be sourced externally by a host CPU or on-chip by the x86 cores, and these may be directed to any on-chip core. Messages may be hierarchically encapsulated for off-chip communication over a second-level network. The mesh protocol supports scaling to 4096 on-chip cores and, through hierarchical addressing, up to 16,384 chips.
Each neuromorphic core implements 1,024 primitive spiking neural units (compartments) grouped into sets of trees constituting neurons. The compartments, along with their fan-in and fan-out connectivity, share configuration and state variables in 10 architectural memories. Their state variables are updated in a time-multiplexed, pipelined manner every algorithmic time-step. When a neuron’s activation exceeds some threshold level, it generates a spike message that is routed to a set of fan-out compartments contained in some number of destination cores.
Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on 2 independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) =0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction. [Sex detection replicated in Korot et al 2021.]
2018-defauw.pdf: “Clinically applicable deep learning for diagnosis and referral in retinal disease”, Jeffrey Fauw, Joseph R. Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan Oamp#x02019;Donoghue, Daniel Visentin, George Driessche, Balaji Lakshminarayanan, Clemens Meyer, Faith Mackinder, Simon Bouton, Kareem Ayoub, Reena Chopra, Dominic King, Alan Karthikesalingam, Camp#x000ED;an O. Hughes, Rosalind Raine, Julian Hughes, Dawn A. Sim, Catherine Egan, Adnan Tufail, Hugh Montgomery, Demis Hassabis, Geraint Rees, Trevor Back, Peng T. Khaw, Mustafa Suleyman, Julien Cornebise, Pearse A. Keane, Olaf Ronneberger
Language models (LMs) have gained dramatic improvement in the past years due to the wide application of neural networks. This raises the question of how far we are away from the perfect language model and how much more research is needed in language modelling. As for perplexity giving a value for human perplexity (as an upper bound of what is reasonably expected from an LM) is difficult. Word error rate (WER) has the disadvantage that it also measures the quality of other components of a speech recognizer like the acoustic model and the feature extraction. We therefore suggest evaluating LMs in a generative setting (which has been done before on selected hand-picked examples) and running a human evaluation on the generated sentences. The results imply that LMs need about 10 to 20 more years of research before human performance is reached. Moreover, we show that the human judgement scores on the generated sentences and perplexity are closely correlated. This leads to an estimated perplexity of 12 for an LM that would be able to pass the human judgement test in the setting we suggested.
[Keywords: language model, generative task, human judgement score, performance gap]
YouTube represents one of the largest scale and most sophisticated industrial recommendation systems in existence. In this paper, we describe the system at a high level and focus on the dramatic performance improvements brought by deep learning. The paper is split according to the classic two-stage information retrieval dichotomy: first, we detail a deep candidate generation model and then describe a separate deep ranking model [since upgraded to REINFORCE]. We also provide practical lessons and insights derived from designing, iterating and maintaining a massive recommendation system with enormous user-facing impact.
[Keywords: recommender system, deep learning, scalability]
Nonhuman autonomous systems are not legal persons under current law. The history of organizational law, however, demonstrates that agreements can, with increasing degrees of autonomy, direct the actions of legal persons. Agreements are isomorphic with algorithms; that is, a legally enforceable agreement can give legal effect to the arbitrary discernible states of an algorithm or other process. As a result, autonomous systems may end up being able, at least, to emulate many of the private-law rights of legal persons. This essay demonstrates a technique by which this is possible by means of limited liability companies (LLCs), a very flexible modern type of business organization. The techniques that this essay describes are not just futuristic possibilities; as this essay argues, they are already possible under current law.
Yahoo’s recently open sourced neural network, open_nsfw, is a fine tuned Residual Network which scores images on a scale of 0 to 1 on its suitability for use in the workplace…What makes an image NSFW, according to Yahoo? I explore this question with a clever new visualization technique by Nguyen et al…Like Google’s Deep Dream, this visualization trick works by maximally activating certain neurons of the classifier. Unlike deep dream, we optimize these activations by performing descent on a parameterization of the manifold of natural images.
[Demonstration of an unusual use of backpropagation to ‘optimize’ a neural network: instead of taking a piece of data to input to a neural network and then updating the neural network to change its output slightly towards some desired output (such as a correct classification), one can instead update the input so as to make the neural net output slightly more towards the desired output. When using a image classification neural network, this reversed form of optimization will ‘hallucinate’ or ‘edit’ the ‘input’ to make it more like a particular class of images. In this case, a porn/NSFW-detecting NN is reversed so as to make images more (or less) “porn-like”. Goh runs this process on various images like landscapes, musical bands, or empty images; the maximally/minimally porn-like images are disturbing, hilarious, and undeniably pornographic in some sense.]
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer.
These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
The automatic transcription of text in handwritten documents has many applications, from automatic document processing, to indexing and document understanding.
One of the most popular approaches nowadays consists in scanning the text line image with a sliding window, from which features are extracted, and modeled by Hidden Markov Models (HMMs). Associated withneural networks, such as Multi-Layer Perceptrons (MLPs) or Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs), and with a language model, these models yield good transcriptions. On the other hand, in many machine learning applications, including speech recognition and computer vision, deep neural networks consisting of several hidden layers recently produced a large reduction of error rates.
In this thesis, we have conducted a thorough study of different aspects of optical models based on deep neural networks in the hybrid neural network / HMM scheme, in order to better understand and evaluate their relative importance.
First, we show that deep neural networks produce consistent and large improvements over networks with one or 2 hidden layers, independently of the kind of neural network, MLP orRNN, and of input, handcrafted features or pixels.
Then, we show that deep neural networks with pixel inputs compete with those using handcrafted features, and that depth plays an important role in the reduction of the performance gap between the 2 kinds of inputs, supporting the idea that deep neural networks effectively build hierarchical and relevant representations of their inputs, and that features are automatically learnt on the way.
Despite the dominance of LSTM-RNNs in the recent literature ofhandwriting recognition, we show that deep MLPs achieve comparable results. Moreover, we evaluated different training criteria. With sequence-discriminative training, we report similar improvements for MLP/HMMs as those observed in speech recognition.
We also show how the Connectionist Temporal Classification framework is especially suited to RNNs.
Finally, the novel dropout technique to regularize neural networks was recently applied to LSTM-RNNs. We tested its effect at different positions in LSTM-RNNs, thus extending previous works, and we show that its relative position to the recurrent connections is important.
We conducted the experiments on 3 public databases, representing 2 languages (English and French) and 2 epochs, using different kinds of neural network inputs: handcrafted features and pixels. We validated our approach by taking part to the HTRtS contest in 2014.
The results of the final systems presented in this thesis, namely MLPsand RNNs, with handcrafted feature or pixel inputs, are comparable tothe state-of-the-art on Rimes and IAM. Moreover, the combination of these systems outperformed all published results on the considered databases.
I draw the reader’s attention to machine teaching, the problem of finding an optimal training set given a machine learning algorithm and a target model. In addition to generating fascinating mathematical questions for computer scientists to ponder, machine teaching holds the promise of enhancing education and personnel training. The Socratic dialogue style aims to stimulate critical thinking.
Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to lookat the past, present, and future of NLP technology in a new light. Borrowing the paradigm of ` jumping curves’ from the field of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves—namely Syntactics, Semantics, and Pragmatics Curves—which will eventually lead NLP research to evolve into natural language understanding.
We examine evidence of progress in 6 areas of algorithms research [SAT, chess+Go, factoring, physics simulations, linear programming+scheduling, machine learning], with an eye to understanding likely algorithmic trajectories after the advent of artificial general intelligence. Many of these areas appear to experience fast improvement, though the data are often noisy. For tasks in these areas, gains from algorithmic progress have been roughly 50 to 100% as large as those from hardware progress. Improvements tend to be incremental, forming a relatively smooth curve on the scale of years.
I. J. Good’s thesis of the “intelligence explosion” states that a sufficiently advanced machine intelligence could build a smarter version of itself, which could in turn build an even smarter version, and that this process could continue to the point of vastly exceeding human intelligence. As Sandberg 2010 correctly notes, there have been several attempts to lay down return on investment formulas intended to represent sharp speedups in economic or technological growth, but very little attempt has been made to deal formally with Good’s intelligence explosion thesis as such.
I identify the key issue as returns on cognitive reinvestment—the ability to invest more computing power, faster computers, or improved cognitive algorithms to yield cognitive labor which produces larger brains, faster brains, or better mind designs. There are many phenomena in the world which have been argued to be evidentially relevant to this question, from the observed course of hominid evolution, to Moore’s Law, to the competence over time of machine chess-playing systems, and many more. I go into some depth on some debates which then arise on how to interpret such evidence. I propose that the next step in analyzing positions on the intelligence explosion would be to formalize return on investment curves, so that each stance can formally state which possible micro-foundations they hold to be falsified by historical observations. More generally I pose multiple open questions of “returns on cognitive reinvestment” or “intelligence explosion microeconomics.” Although such questions have received little attention thus far, they seem highly relevant to policy choices affecting outcomes for Earth-originating intelligent life.
Reduction to SAT is a very successful approach to solving hard combinatorial problems in Artificial Intelligence and computer science in general. Most commonly, problem instances reduced to SAT are solved with a general-purpose SAT solver. Although there is the obvious possibility of improving the SAT solving process with application-specific heuristics, this has rarely been done successfully.
In this work we propose a planning-specific variable selection strategy for SAT solving. The strategy is based on generic principles about properties of plans, and its performance with standard planning benchmarks often substantially improves on generic variable selection heuristics, such as VSIDS, and often lifts it to the same level with other search methods such as explicit state-space search with heuristic search algorithms.
Over the years, the competitions have substantially contributed to the fast progress in SAT solvertechnology that has made SAT a practical success story of computer science. This short article provides an overview of the SAT solver competitions.
The competitive MNIST handwritten digit recognition benchmark has a long history of broken records since 1998. The most recent advancement by others dates back 8 years (error rate 0.4%).
Good old on-line backpropagation for plain multi-layer perceptrons yields a very low 0.35% error rate on the MNIST handwritten digitsbenchmark with a single MLP, and 0.31% with a committee of 7 MLPs.
All we need to achieve this until-2011-best-result are many hidden layers, many neurons per layer, numerous deformed training images to avoid overfitting, and graphics cards to greatly speed up learning.
[Keywords: neural network, multilayer perceptron, GPU, training set deformations, MNIST, committee,backpropagation]
Note: This work combines 3 previously published papers [1,2,3].
…In recent decades the amount of raw computing power per Euro has grown bya factor of 100–1000 per decade. Our results show that this ongoing hardware progress may be more important than advances in algorithms and software (although the future will belong to methods combining the best of both worlds). Current graphics cards (GPUs) are already more than 50× faster than standard microprocessors when it comes to training big and deep neural networks by the ancient algorithm, online backpropagation (weight update rate up to 7.5×109/s, and more than 1015 per trained network). On the competitive MNIST handwriting benchmark, single precision floating-point GPU-based neural nets surpass all previously reported results, including those obtained by much more complex methods involving specialized architectures, unsupervised pre-training, combinations of machine learning classifiers etc. Training sets of sufficient size to avoid overfitting are obtained by appropriately deforming images.
Of course, the approach is not limited to handwriting, and obviously holds great promise for many visual and other pattern recognition problems.
At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.
…For many tasks, words and word combinations provide all the representational machinery we need to learn from text.
…So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do
The Asirra CAPTCHA [EDHS2007], proposed at ACM CCS 2007, relies on the problem of distinguishing images of cats and dogs (a task that humans are very good at). The security of Asirra is based on the presumed difficulty of classifying these images automatically.
In this paper, we describe a classifier which is 82.7% accurate in telling apart the images of cats and dogs used in Asirra. This classifier is a combination of support-vector machine classifiers trained on color and texture features extracted from images. Our classifier allows us to solve a 12-image Asirra challenge automatically with probability 10.3%. This probability of success is statistically-significantly higher than the estimate of 0.2% given in [EDHS2007] for machine vision attacks. Our results suggest caution against deploying Asirra without safeguards.
We also investigate the impact of our attacks on the partial credit and token bucket algorithms proposed in [EDHS2007]. The partial credit algorithm weakens Asirra considerably and we recommend against its use. The token bucket algorithm helps mitigate the impact of our attacks and allows Asirra to be deployed in a way that maintains an appealing balance between usability and security. One contribution of our work is to inform the choice of safeguard parameters in Asirra deployments.
…Our classifier is a combination of 2 support-vector machine  (SVM) classifiers trained on color and texture features of images. The classifier is entirely automatic, and requires no manual input other than the one-time labelling of training images. Using 15,760 color features, and 5,000 texture features per image, our classifier is 82.7% accurate. The classifier was trained on a commodity PC, using 13,000 labeled images of cats and dogs downloaded from the Asirra website .
One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways.
We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted.
We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves.
We then show that self-improving systems will be driven to clarify their goals and represent them as economic utility functions. They will also strive for their actions to approximate rational economic behavior. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. We also discuss some exceptional systems which will want to modify their utility functions.
We next discuss the drive toward self-protection which causes systems try to prevent themselves from being harmed. Finally we examine drives toward the acquisition of resources and toward their efficient utilization.
We end with a discussion of how to incorporate these insights in designing intelligent technology which will lead to a positive future for humanity.
We present Asirra (Figure 1), a CAPTCHA that asks users to identify cats out of a set of 12 photographs of both cats and dogs.
Asirra is easy for users; user studies indicate it can be solved by humans 99.6% of the time in under 30 seconds. Barring a major advance in machine vision, we expect computers will have no better than a 1/54,000 chance of solving it. Asirra’s image database is provided by a novel, mutually beneficial partnership with Petfinder.com. In exchange for the use of their 3 million images, we display an “adopt me” link beneath each one, promoting Petfinder’s primary mission of finding homes for homeless animals.
We describe the design of Asirra, discuss threats to its security, and report early deployment experiences. We also describe 2 novel algorithms for amplifying the skill gap between humans and computers that can be used on many existing CAPTCHAs.
This paper reports on the benefits of large-scale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train on up to 2 trillion tokens, resulting in language models having up to 300 billion n-grams. It is capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbed Stupid Backoff, that is inexpensive to train on large datasets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.
This chapter develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation-estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the under-lying optimization algorithm in non-trivial ways. For instance, a mediocre optimization algorithm, stochastic gradient descent, is shown to perform very well on large-scale learning problems.
…This chapter develops the ideas initially proposed by Bottou & Bousquet 2008 [“The tradeoffs of large scale learning”, NIPS 2007]. Section 13.2 proposes a decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well-known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. Section 13.3 explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate of the statistical estimation error. Section 13.4 reports experimental results supporting this analysis.
…These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the objective function and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences:
The SGDand 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (13.11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error.
Second-order algorithms bring few asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all 4 algorithms are dominated by the polynomial term in (1⁄ε). However, there are important variations in the influence of the constants d, κ, and ν.These constants are very important in practice.
Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite showing the worst optimization performance on the empirical cost. This phenomenon has already been described and observed in experiments (eg Bottou & Le Cun 2004).
In contrast, since the optimization error εopt of small-scale learning systems can be reduced to insignificant levels, their generalization performance is determined solely by the statistical properties of the objective function.
…Figure 13.1 shows how much time each algorithm takes to reach a given optimization accuracy. The superlinear algorithm TRON reaches the optimum with 10 digits of accuracy in less than one minute. The stochastic gradient starts more quickly but is unable to deliver such a high accuracy. The upper part of the figure clearly shows that the testing set loss stops decreasing long before the superlinear algorithm overcomes the SGD algorithm.
Figure 13.2 shows how the testing loss evolves with the training time. The stochastic gradient descent curve can be compared with the curves obtained using conjugate gradients on subsets of the training examples with increasing sizes. Assume, for instance, that our computing time budget is 1 second. Running the conjugate gradient algorithm on a random subset of 30,000 training examples achieves a much better performance than running it on the whole training set. How to guess the right subset size a priori remains unclear. Meanwhile, running the SGD algorithm on the full training set reaches the same testing set performance much faster.
…Conclusion: Taking into account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the objective function and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms.
This framework leaves room for considerable refinements. Shalev-Shwartz & Srebro 2008 rigorously extend the analysis to regularized risk formulations with linear parameterization and find again that, for learning purposes, SGD algorithms are often more attractive than standard primal or dual algorithms with good optimization complexity (Joachims 2006; Hush et al 2006). It could also be interesting to investigate how the choice of a surrogate loss function (Zhang 2004; Bartlett et al 2006) impacts the large-scale case.
Within the framework of the automatic processing of incoming mail documents, we present in this thesis the conception and development of a numerical field extraction system in weakly constrained handwritten documents.
Although the recognition of isolated handwritten entities can be considered as a partially solved problem, the extraction of information in images of complex and free-layout documents is still a challenge. This problem requires the implementation of both handwriting recognition and information extraction methods inspired by approaches developed within the field of information extraction in electronic documents.
Our contribution consists in the conception and the implementation of 2 different strategies: the first extends classical handwriting recognition methods, while the second is inspired from approaches used within the field of information extraction in electronic documents.
The results obtained on a real handwritten mail database show that our second approach is substantially better.
Finally, a complete, generic and efficient system is produced, answering one of the emergent perspectives in the field of the automatic reading of handwritten documents: the extraction of complex information in images of documents. [Text of paper is in French.]
By formulating Helmholtz’s ideas about perception, in terms of modern-day theories, one arrives at a model of perceptual inference and learning that can explain a remarkable range of neurobiological facts: using constructs from statistical physics, the problems of inferring the causes of sensory input and learning the causal structure of their generation can be resolved using exactly the same principles. Furthermore, inference and learning can proceed in a biologically plausible fashion. The ensuing scheme rests on Empirical Bayes and hierarchical models of how sensory input is caused. The use of hierarchical models enables the brain to construct prior expectations in a dynamic and context-sensitive fashion. This scheme provides a principled way to understand many aspects of cortical organisation and responses.
In this paper, we show these perceptual processes are just one aspect of emergent behaviours of systems that conform to a free energy principle. The free energy considered here measures the difference between the probability distribution of environmental quantities that act on the system and an arbitrary distribution encoded by its configuration. The system can minimise free energy by changing its configuration to affect the way it samples the environment or change the distribution it encodes. These changes correspond to action and perception respectively and lead to an adaptive exchange with the environment that is characteristic of biological systems. This treatment assumes that the system’s state and structure encode an implicit and probabilistic model of the environment. We will look at the models entailed by the brain and how minimisation of its free energy can explain its dynamics and structure.
We present a large-scale experimental comparison of logistic regression and tree induction (C4.5), assessing classification accuracy and the quality of rankings based on class-membership probabilities.
We use a learning-curve analysis to examine the relationship of these measures to the size of the training set.
The results of the study show several things:
Contrary to some prior observations, logistic regression does not generally outperform tree induction.
More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (that is, the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves.
Contrary to conventional wisdom, tree induction is effective at producing probability-based rankings, although apparently comparatively less so for a given training-set size than at making classifications. Finally,
the domains on which tree induction and logistic regression are ultimately preferable can be characterized surprisingly well by a simple measure of the separability of signal from noise. [Keywords: decision trees, learning curves, logistic regression, ROC analysis, tree induction]
…The average data-set size is larger than is usual in machine-learning research, and we see behavioral characteristics that would be overlooked when comparing algorithms only on smaller data sets (such as most in the UCI repository; see Blake & Merz 2000).
…Papers such as this seldom consider carefully the size of the data sets to which the algorithms are being applied. Does the relative performance of the different learning methods depend on the size of the data set?
More than a decade ago in machine learning research, the examination of learning curves was commonplace (see, for example, Kibler & Langley 1988, but usually on single data sets (notable exceptions being the study by Shavlik et al 1991, and the work of Catlett 1991 [“Megainduction: machine learning on very large databases”]). Now learning curves are presented only rarely in comparisons of learning algorithms. Learning curves also are found in the statistical literature (Flury & Schmid 1994) and in the neural network literature (Cortes et al 1994). They have been analyzed theoretically, using statistical mechanics (Watkin et al 1993; Haussler et al 1996.
The few cases that exist draw conflicting conclusions, with respect to our goals. Domingos & Pazzani 1997 compare classification-accuracy learning curves of naive Bayes and the C4.5RULES rule learner (Quinlan 1993). On synthetic data, they show that naive Bayes performs better for smaller training sets and C4.5RULES performs better for larger training sets (the learning curves cross). They discuss that this can be explained by considering the different bias/variance profile of the algorithms for classification (zero/one loss). Roughly speaking,4variance plays a more critical role than estimation bias when considering classification accuracy. For smaller data sets, naive Bayes has a substantial advantage over tree or rule induction in terms of variance. They show that this is the case even when (by their construction) the rule learning algorithm has no bias. As expected, as larger training sets reduce variance, C4.5RULES approaches perfect classification. Brain & Webb 1999 perform a similar bias/variance analysis of C4.5 and naive Bayes. They do not examine whether the curves cross, but do show on 4 UCI data sets that variance is reduced consistently with more data, but bias is not. These results do not directly examine logistic regression, but the bias/variance arguments do apply: logistic regression, a linear model, should have higher bias but lower variance than tree induction. Therefore, one would expect that their learning curves might cross.
However, the results of Domingos & Pazzani 1997 were generated from synthetic data where the rule learner had no bias. Would we see such behavior on real-world domains? Kohavi 1996 shows classification-accuracy learning curves of tree induction (using C4.5) and of naive Bayes for 9 UCI data sets. With only one exception, either naive Bayes or tree induction dominates (that is, the performance of one or the other is superior consistently for all training-set sizes). Furthermore, by examining the curves, Kohavi concludes that “In most cases, it is clear that even with much more data, the learning curves will not cross” (pp. 203–204).
We are aware of only one learning-curve analysis that compares logistic regression and tree induction. Harris-Jones & Haines 1997 [“Sample size and misclassification: is more always better?”] compare them on 2 business data sets, one real and one synthetic. For these data the learning curves cross, suggesting (as they observe) that logistic regression is preferable for smaller data sets and tree induction for larger data sets. Our results generally support this conclusion.
…These results concur with recent results (Ng & Jordan 2001) comparing discriminative and generative versions of the same model (viz., logistic regression and naive Bayes), which show that learning curves often cross…A corollary observation is that even for very large data-set sizes, the slope of the learning curves remains distinguishable from zero. Catlett 1991 concluded that learning curves continue to grow, on several large-at-the-time data sets (the largest with fewer than 100,000 training examples).14Provost & Kolluri 1999 suggest that this conclusion should be revisited as the size of data sets that can be processed (feasibly) by learning algorithms increases. Our results provide a contemporary reiteration of Catlett’s. On the other hand, our results seemingly contradict conclusions or assumptions made in some prior work. For example, Oates & Jensen 1997 conclude that classification-tree learning curves level off, and Provost et al 1999 replicate this finding and use it as an assumption of their sampling strategy. Technically, the criterion for a curve to have reached a plateau in these studies is that there be less than a certain threshold (<1%) increase in accuracy from the accuracy with the largest data-set size; however, the conclusion often is taken to mean that increases in accuracy cease. Our results show clearly that this latter interpretation is not appropriate even for our largest data-set sizes.
Neural networks are a powerful technology for classification of visual inputs arising from documents. However, there is a confusing plethora of different neural network methods that are used in the literature and in industry.
This paper describes a set of concrete best practices that document analysis researchers can use to get good results with neural networks.
The most important practice is getting a training set as large as possible: we expand the training set by adding a new form of distorted data.
The next most important practice is that convolutional neural networks are better suited for visual document tasks than fully connected networks. We propose that a simple “do-it-yourself” implementation of convolution with a flexible architecture is suitable for many visual document problems. This simple convolutional neural network does not require complex methods, such as momentum, weight decay, structure-dependent learning rates, averaging layers, tangent prop, or even finely-tuning the architecture.
The end result is a very simple yet general architecture which can yield state-of-the-art performance for document analysis.
We illustrate our claims on the MNIST set of English digit images.
This paper is an invited contribution to the 50th anniversary issue of the journal Operations Research, published by the Institute of Operations Research and Management Science (INFORMS). It describes one person’s perspective on the development of computational tools for linear programming. The paper begins with a short personal history, followed by historical remarks covering the some 40 years of linear-programming developments that predate my own involvement in this subject. It concludes with a more detailed look at the evolution of computational linear programming since 1987.
…In this paper I have focused primarily on one issue, solving larger, more difficult linear programs faster. The numbers presented speak for themselves. 3 orders of magnitude in machine speed and 3 orders of magnitude in algorithmic speed add up to six orders of magnitude in solving power: A model that might have taken a year to solve 10 years ago can now solve in less than 30 seconds. Of course, no one waits 1 year to solve a model, at least no one I know. The real meaning of such an advance is much harder to measure in practice, but it is real nevertheless. There is no doubt that we now have optimization engines at our disposal that dwarf what was available only a few years ago, making possible the solution of real-world models once considered intractable, and opening up whole new domains of application.
How do these speed improvements fit into the overall picture of linear-programming practice? They are only a part of that picture, though an essential, enabling part. The pervasive availability of powerful, usable desktop computing, the availability of data to feed our models, and the emergence of algebraic modeling languages to represent our models have all combined with the underlying engines to make operations research and linear programming the powerful tools they are today. However, there are still important issues to be solved. In spite of all the advances, the application of linear programming remains primarily the domain of experts. The need for abstraction still stands as a hurdle between technology and solutions. While the existence of this hurdle is disconcerting, it is at least gratifying to know that the benefits from overcoming it are now greater than ever.
Previous work on estimating the entropy of written natural language has focused primarily on English. We expand this work by considering other natural languages, including Arabic, Chinese, French, Greek, Japanese, Korean, Russian, and Spanish. We present the results of PPM compression on machine-generated and human-generated translations of texts into various languages. Under the assumption that languages are equally expressive, and that PPM compression does well across languages, one would expect that translated documents would compress to approximately the same size. We verify this empirically on a novel corpus of translated documents. We suggest as an application of this finding using the size of compressed natural language texts as a mean of automatically testing translation quality.
The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
…We collected a 1-billion-word training corpus from a variety of English texts, including news articles, scientific abstracts, government transcripts, literature and other varied forms of prose. This training corpus is three orders of magnitude greater than the largest training corpus previously used for this problem. We used 1 million words of Wall Street Journal text as our test set, and no data from the Wall Street Journal was used when constructing the training corpus. Each learner was trained at several cutoff points in the training corpus, ie. the first one million words, the first five million words, and so on, until all one billion words were used for training. In order to avoid training biases that may result from merely concatenating the different data sources to form a larger training corpus, we constructed each consecutive training corpus by probabilistically sampling sentences from the different sources weighted by the size of each source.
In Figure 1, we show learning curves for each learner, up to one billion words of training data. Each point in the graph is the average performance over ten confusion sets for that size training corpus. Note that the curves appear to be log-linear even out to one billion words.
We compare discriminative and generative learning as typified by logistic regression and naive Bayes.
We show, contrary to a widely-held belief that discriminative classifiers are almost always to be preferred, that there can often be 2 distinct regimes of performance as the training set size is increased, one in which each algorithm does better.
This stems from the observation—which is borne out in repeated experiments—that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster.
Having access to massive amounts of data does not necessarily imply that induction algorithms must use them all. Samples often provide the same accuracy with far less computational cost. However, the correct sample size rarely is obvious.
We analyze methods for progressive sampling—using progressively larger samples as long as model accuracy improves. We explore several notions of efficient progressive sampling.
We analyze efficiency relative to induction with all instances; we show that a simple, geometric sampling schedule is asymptotically optimal, and we describe how best to take into account prior expectations of accuracy convergence.
We then describe the issues involved in instantiating an efficient progressive sampler, including how to detect convergence. Finally, we provide empirical results comparing a variety of progressive sampling methods. We conclude that progressive sampling can be remarkably efficient.
One of the defining challenges for the KDD research community is to enable inductive learning algorithms to mine very large databases. This paper summarizes, categorizes, and compares existing work on scaling up inductive algorithms.
We concentrate on algorithms that build decision trees and rule sets, in order to provide focus and specific details; the issues and techniques generalize to other types of data mining.
We begin with a discussion of important issues related to scaling up. We highlight similarities among scaling techniques by categorizing them into 3 main approaches. For each approach, we then describe, compare, and contrast the different constituent techniques, drawing on specific examples from published papers.
Finally, we use the preceding analysis to suggest how to proceed when dealing with a large problem, and where to focus future research.
With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. However, machine learning evolved in a different research context to that in which it now finds itself employed. A particularly important problem in the data mining world is working effectively with large data sets. However, most machine learning research has been conducted in the context of learning from very small data sets.
To date most approaches to scaling up machine learning to large data sets have attempted to modify existing algorithms to deal with large data sets in a more computationally efficient and effective manner. But is this necessarily the best method?
This paper explores the possibility of designing algorithms specifically for large data sets. Specifically, the paper looks at how increasing data set size affects bias and variance error decompositions for classification algorithms.
Preliminary results of experiments to determine these effects are presented, showing that, as hypothesized variance can be expected to decrease as training set size increases. No clear effect of training set size on bias was observed.
These results have profound implications for data mining from large data sets, indicating that developing effective learning algorithms for large data sets is not simply a matter of finding computationally efficient variants of existing learning algorithms.
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter’s (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM).
Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is 𝒪(1).
Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored.
Empirical results showing that it performs surprisingly well in many domains containing clear attribute dependences suggest that the answer to this question may be positive.
This article shows that, although the Bayesian classifier’s probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zero-one loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadratic-loss optimality of the Bayesian classifier is in fact a second-order infinitesimal fraction of the region of zero-one optimality.
This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption.
Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain.
This article’s results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically.
We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a “flat” minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to “simple” networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a “good” weight prior. Instead we have a prior over input output functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation’s order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and “optimal brain surgeon/optimal brain damage.”
This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no substantial increase in classification accuracy. Said differently, removing randomly selected training instances often results in trees that are substantially smaller and just as accurate as those built on all available training instances.
This implies that decreases in tree size obtained by more sophisticated data reduction techniques should be decomposed into 2 parts: that which is due to reduction of training set size, and the remainder, which is due to how the method selects instances to discard.
We perform this decomposition for one recent data reduction technique, John’s ROBUSTC4.5 (John 1995), and show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set.
We conclude that random data reduction is a baseline against which more sophisticated data reduction techniques should be compared. Finally, we examine one possible cause of the pathological relationship between tree size and training set size.
In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics.
The advantage of our theory over the well-established Vapnik-Chervonenkis theory is that our bounds can be considerably tighter in many cases, and are also more reflective of the true behavior of learning curves.
This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory.
The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes.
We illustrate our results with many concrete examples of learning curve bounds derived from our theory.
Naive-Bayes induction algorithms were previously shown to be surprisingly accurate on many classification tasks even when the conditional independence assumption on which they are based is violated. However, most studies were done on small databases.
We show that in some larger databases, the accuracy of Naive-Bayes does not scale up as well as decision trees.
We then propose a new algorithm, NBTree, which induces a hybrid of decision-tree classifiers and Naive-Bayes classifiers: the decision-tree nodes contain univariate splits as regular decision-trees, but the leaves contain Naive-Bayesian classifiers. The approach retains the interpretability of Naive-Bayes and decision trees, while resulting in classifiers that frequently outperform both constituents, especially in the larger databases tested.
We estimate a neural network’s ability to generalize from examples using ideas from statistical mechanics. We discuss the connection between this approach and other powerful concepts from mathematical statistics, computer science, and information theory that are useful in explaining the performance of such machines. For the simplest network, the perceptron, we introduce a variety of learning problems that can be treated exactly by the replica method of statistical physics.
The theoretical work by Weiner and others on the spectral analysis of stationary time series penetrated statistics following Tukey’s heuristic work on estimation of the spectrum. In refereeing papers for NIPS the author was struck by the growing emphasis on mathematical theory. Mathematical theory is not critical to the development of machine learning. In machine learning, the current panacea is a sigmoid network fitted using backpropagation. The pi-method, for approximating functions using noisy data, was suggested by results in mathematical approximation theory. In spite of intense activity, none of the work has had any effect on the day-to-day practice of statistics, or even on present-day theory. The useful theories was not meant to be inclusive, but even a more inclusive list would be very short. A possible reason is that it is difficult to formulate reasonable analytic models for complex data.
…Uses Of Theory
Comfort: We knew it worked, but it’s nice to have a proof.
Insight: Aha! So that’s why it works.
Innovation: At last, a mathematically proven idea that applies to data.
Suggestion: Something like this might work with data.
…Our fields would be better off with far fewer theorems, less emphasis on faddish stuff, and much more scientific inquiry and engineering. But the latter requires real thinking. For instance, there are many important questions regarding neural networks which are largely unanswered. There seem to be conflicting stories regarding the following issues:
Why don’t heavily parameterized neural networks overfit the data?
What is the effective number of parameters?
Why doesn’t backpropagation head for a poor local minima?
When should one stop the backpropagation and use the current parameters?
It makes research more interesting to know that there is no one universally best method. What is best is data dependent. Sometimes “least glamorous” methods such as nearest neighbor are best. We need to learn more about what works best where. But emphasis on theory often distracts us from doing good engineering and living with the data.
Time is at the heart of many pattern recognition tasks (eg., speech recognition). However, connectionist learning algorithms to date are not well-suited for dealing with time-varying input patterns.
This chapter introduces a specialized connectionist architecture and corresponding specialization of the backpropagation learning algorithm that operates efficiently, both in computational time and space requirements, on temporal sequences. The key feature of the architecture is a layer of self-connected hidden units that integrate their current value with the new input at each time step to construct a static representation of the temporal input sequence. This architecture avoids two deficiencies found in the backpropagation unfolding-in-time procedure (Rumelhart, Hinton, & Williams, 1986) for handing sequence recognition tasks: first, it reduces the difficulty of temporal credit assignment by focusing the backpropagated error signal; second, it eliminates the need for a buffer to hold the input sequence and/or intermediate activity levels. The latter property is due to the fact that during the forward (activation) phase, incremental activity traces can be locally computed that hold all information necessary for backpropagation in time.
It is argued that this architecture should scale better than conventional recurrent architectures with respect to sequence length. The architecture has been used to implement a temporal version of Rumelhart and McClelland’s (1986) verb past-tense model. The hidden units learn to behave something like Rumelhart and McClelland’s “Wickelphones”, a rich and flexible representation of temporal information
A summary is presented of the statistical mechanical theory of learning a rule with a neural network, a rapidly advancing area which is closely related to other inverse problems frequently encountered by physicists. By emphasizing the relationship between neural networks and strongly interacting physical systems, such as spin glasses, the authors show how learning theory has provided a workshop in which to develop new, exact analytical techniques.
The present paper elucidates a universal property of learning curves, which shows how the generalization error, training error, and the complexity of the underlying stochastic machine are related and how the behavior of a stochastic machine is improved as the number of training examples increases. The error is measured by the entropic loss. It is proved that the generalization error converges to H0, the entropy of the conditional distribution of the true machine, as H0 + m*/(2t), while the training error converges as H0—m*/(2t), where t is the number of examples and m* shows the complexity of the network. When the model is faithful, implying that the true machine is in the model, m* is reduced to m, the number of modifiable parameters. This is a universal law because it holds for any regular machine irrespective of its structure under the maximum likelihood estimator. Similar relations are obtained for the Bayes and Gibbs learning algorithms. These learning curves show the relation among the accuracy of learning, the complexity of a model, and the number of training examples.
[Reprint of 1986 chapter] This paper presents an alternative to the standard rule based account of a child’s acquisition of the past tense in English. Children are typically said to pass through a 3-phase acquisition process in which they first learn past tense by rote, then learn the past tense rule and over regularize, and then finally learn the exceptions to the rule.
We show that the acquisition data can be accounted for in more detail by dispensing with the assumption that the child learns rules and substituting in its place a simple homogeneous learning procedure. We show how rule-like behavior can emerge from the interactions among a network of units encoding the root form to past tense mapping.
A large computer simulation of the learning process demonstrates the operating principles of our alternative account, shows how details of the acquisition process not captured by the rule account emerge, and makes predictions about other details of the acquisition process not yet observed.
Training classifiers on large databases is computationally demanding. It is desirable to develop efficient procedures for a reliable prediction of a classifier’s suitability for implementing a given task, so that resources can be assigned to the most promising candidates or freed for exploring new classifier candidates.
We propose such a practical and principled predictive method. Practical because it avoids the costly procedure of training poor classifiers on the whole training set, and principled because of its theoretical foundation.
The effectiveness of the proposed procedure is demonstrated for both single-and multi-layer networks.
Previous algorithms for supervised sequence learning are based on dynamic recurrent networks. This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: The first net learns to produce context-dependent weight changes for the second net whose weights may vary very quickly. The method offers the potential for STM storage efficiency: A single weight (instead of a full-fledged unit) may be sufficient for storing temporal information. Various learning methods are derived. Two experiments with unknown time delays illustrate the approach. One experiment shows how the system can be used for adaptive temporary variable binding.
Since the seminal article by Williams, Hinton, and Rumelhart[RHW86],backpropagation (BP) has become very popular as a learning method for neural networks with and without feedback. In contrast to many other learning methods for neural networks, BP takes into account the network structure and improves the network on the basis of this knowledge.
Since a very remote past input has to influence the present output, if it is randomly selected, this input is very unlikely to influence the present state of the network. Hence BP algorithms do not detect the fact that this input is responsible for the output desired. Therefore, BP algorithms are very hard to train a network to remember an input until it is needed to produce a later output. Moreover, the public BP algorithms take a very long time to compute.
In many cases, though, one needs an input sequence, as in Mozer [Moz90], which learns to compose music, where musical pieces are repeated and later note pitches are determined by previous note pitches. Steers a vehicle in a labyrinth, and the network obtains the error information only if the vehicle is in a dead end, so back-propagated errors are needed. If a neural network controls a robot that performs a task, perhaps some preparatory tasks are necessary whose performance the system should remember.
In this work, we investigate how to approach the problem of the long learning time associated with net-work inputs that are used later to control a desired output. This can be done either by means of the net-work architecture or by using the structure of the input sequences. In Chapter 4, a network is built so that inputs that are received at long delays are considered better than in the usual network architecture. Here, ‘storage nodes’ are introduced, which can carry information about an arbitrarily long time interval. The shortening of the input sequences, while retaining all relevant information, is investigated in Chapter 3. When a shortened input sequence must be recognized within this not so far back into the past, to recognize the relevant inputs. In Chapter 1 the used BP-Learn algorithms are presented, which are then in Chapter 2 analyzed to determine the cause of the long learning time to learn to store past inputs. To the algorithms it should be said that in some cases these were slightly modified to save computational time. The problem of resource acquisition occurring in Chapter 3 and 4 methods is addressed in Chapter 5.
The described experiments were performed on Sparc-based SUN stations. Due to time-resource constraints, algorithm comparison tests could not be carried out in the desired extent. There were trials that ran for up to a week on these machines, but other processes with higher priority were also running on these machines.
The definitions and notations in this work are not those commonly used in studies of neural networks, but they are introduced here only for this work. The reason is that there are no uniform, fundamental definitions for neural networks on which other authors would have based their work. Therefore, it is not guaranteed that there are no inconsistencies in the definitions and notations with other works.
These results were not all mathematically proven, as the work does not claim to be a mathematical analysis of neural networks. It is also difficult to find simple mathematical formalisms for neural networks. The work will rather describe ideas and approaches to see if it is possible to get a better grip on the problem of the long learning time for important previous inputs.
Besides the methods described here for learning in non-static environments, there is also the approach of the “Adaptive Critic”, as described in [Sch90a] and [Sch90c [“Recurrent networks adjusted by adaptive critics”]]. The approach of “fast weights” by Schmidhuber [Sch91b] founds a storage function, although with a completely different approach than in Chapter 4, where a storage is also constructed.
Despite the fact that many symbolic and neural network (connectionist) learning algorithms address the same problem of learning from classified examples, very little is known regarding their comparative strengths and weaknesses.
Experiments comparing the ID3 symbolic learning algorithm with the perception and backpropagation neural learning algorithms have been performed using 5 large, real-world data sets.
Overall, backpropagation performs slightly better than the other 2 algorithms in terms of classification accuracy on new examples, but takes much longer to train. Experimental results suggest that backpropagation can work statistically-significantly better on data sets containing numerical data.
Also analyzed empirically are the effects of (1) the amount of training data, (2) imperfect training examples, and (3) the encoding of the desired outputs.
Backpropagation occasionally outperforms the other 2 systems when given relatively small amounts of training data. It is slightly more accurate than ID3 when examples are noisy or incompletely specified. Finally, backpropagation more effectively utilizes a “distributed” output encoding.
1990-schwartz.pdf: “Exhaustive Learning”, D. B. Schwartz, V. K. Samalam, Sara A. Solla, J. S. Denker (1990-09-01):
Exhaustive exploration of an ensemble of networks is used to model learning and generalization in layered neural networks. A simple Boolean learning problem involving networks with binary weights is numerically solved to obtain the entropy Sm and the average generalization ability Gm as a function of the size m of the training set. Learning curves Gm vs m are shown to depend solely on the distribution of generalization abilities over the ensemble of networks. Such distribution is determined prior to learning, and provides a novel theoretical tool for the prediction of network performance on a specific task.
Time underlies many interesting human behaviors. Thus, the question of how to represent time in connectionist models is very important. One approach is to represent time implicitly by its effects on processing rather than explicitly (as in a spatial representation).
The current report develops a proposal along these lines first described by Jordan 1986 which involves the use of recurrent links in order to provide networks with a dynamic memory. In this approach, hidden unit patterns are fed back to themselves; the internal representations which develop thus reflect task demands in the context of prior internal states.
A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/semantic features for words. The networks are able to learn interesting internal representations which incorporate task demands with memory demands; indeed, in this approach the notion of memory is inextricably bound up with task processing. These representations reveal a rich structure, which allows them to be highly context-dependent, while also expressing generalizations across classes of items.
These representations suggest a method for representing lexical categories and the type/token distinction.
A linearly separable Boolean function is derived from a set of examples by a perceptron with optimal stability. The probability to reconstruct a pattern which is not learnt is calculated analytically using the replica method. [see also “double descent”]
The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length.
Most known learning algorithms for dynamic neural networks in non-stationary environments need global computations to perform credit assignment. These algorithms either are not local in time or not local in space. Those algorithms which are local in both time and space usually cannot deal sensibly with ‘hidden units’. In contrast, as far as we can judge, learning rules in biological systems with many ‘hidden units’ are local in both space and time. In this paper we propose a parallel on-line learning algorithms which performs local computations only, yet still is designed to deal with hidden units and with units whose past activations are ‘hidden in time’. The approach is inspired by Holland’s idea of the bucket brigade for classifier systems, which is transformed to run on a neural network with fixed topology. The result is a feedforward or recurrent ‘neural’ dissipative system which is consuming ‘weight-substance’ and permanently trying to distribute this substance onto its connections in an appropriate way. Simple experiments demonstrating the feasibility of the algorithm are reported.
The real-time recurrent learning algorithm (RTRL) is a gradient-following learning algorithm for completely recurrent networks running in continually sampled time. Here we use a series of simulation experiments to investigate the power and properties of this algorithm. In the recurrent networks studied here, any unit can be connected to any other, and any unit can receive external input. These networks run continually in the sense that they sample their inputs on every update cycle, and any unit can have a training target on any cycle. The storage required and computation time on each step are independent of time and are completely determined by the size of the network, so no prior knowledge of the temporal structure of the task being learned is required. The algorithm is nonlocal in the sense that each unit must have knowledge of the complete recurrent weight matrix and error vector. The algorithm is computationally intensive in sequential computers, requiring a storage capacity of the order of the third power of the number of units and a computation time on each cycle of the order of the fourth power of the number of units. The simulations include examples in which networks are taught tasks not possible with tapped delay lines—that is, tasks that require the preservation of state over potentially unbounded periods of time. The most complex example of this kind is learning to emulate a Turing machine that does a parenthesis balancing problem. Examples are also given of networks that do feedforward computations with unknown delays, requiring them to organize into networks with the correct number of layers. Finally, examples are given in which networks are trained to oscillate in various ways, including sinusoidal oscillation.
Error propagation networks are able to learn a variety of tasks in which a static input pattern is mapped onto a static output pattern. This paper presents a generalisation of these nets to deal with time varying, or dynamic, patterns. Three possible architectures are explored which deal with learning sequences of known finite length and sequences of unknown and possibly infinite length. Several examples are given and an application to speech coding is discussed.
A further development of dynamic nets is made which allows them to be trained by a signal which expresses the correctness of the output of the net, the utility signal. On one possible architecture for such utility driven dynamic nets is given and a simple example is presented. Utility driven dynamic nets are potentially able to calculate and maximize any function of the input and output data streams, within the considered context. This is a very powerful property, and an appendix presents a comparison of the information processing in utility driven dynamic nets and that in the human brain.
[1987 retrospective by noted proponent of logic for planning and reasoning in AI (‘GOFAI’); McDermott criticizes his own work fiercely, along with that of his colleagues (particularly John McCarthy, Robert Moore, James Allen, Jerry Hobbs, & Patrick Hayes), describing the ‘logicist’ paradigm—that sufficiently ingenious and persistent application of logical reasoning, mostly first-order logic, can eventually give rise to human-level understanding of the world, planning & execution of actions, and eventually AGI.
McDermott concludes that the nature of such programs is that they are unable to see if they are making real progress (because a failure to infer something could simply reflect a lacking axiom), and worse, that such logics are not even an approximation to what intelligence is, or a role model, or that failures reflect poor choice of axioms, but that logics only verify things and do not compute useful things like plans, and collapse into verifying trivialities which do no useful intellectual work. Resorts to powerful tools like temporal logics or nonmonotonic logics sacrifice the philosophical advantages of logical inference in an attempt to get working systems, but may obtain neither. What is necessary is doing without deduction.]
It must be the case that a substantial portion of the inferences we want [to make] are deductions, or it will simply be irrelevant how many theorems follow deductively from a given axiom set.
…To summarize: The logicist project of expressing “naive physics” in first-order logic has not been very successful. One reason may be that the basic argument was flawed. You cannot write down axioms independent of a program for manipulating them if the inferences you are interested in are not deductions. Unfortunately, very few interesting inferences are deductions, and the attempts by logicists to extend logic to cover more territory have been disappointing. Hence we must resign ourselves to writing programs, and viewing knowledge representations as entities to be manipulated by the programs.
…Finally, I should admit that I am still doing work in the paradigm that I criticize here. In the domain of shape representation, so little is known that focusing on an idealization cannot but help teach us something. The problem I would like to tackle is representing the knowledge required to answer questions like, Could a paper clip be used as a key ring? The idealization I have been forced to fall back on is to prove that a paper clip of a certain size and shape could fit through the hole of a typical key. It should be obvious how much of the original problem this leaves out. Still, the territory is so unexplored that a tour through the idealized fragment could turn up something interesting. What one cannot hope for is to express as logical axioms everything there is to know about using shapes in unusual ways, before designing programs for this task. This will probably come as a shock to no one but me and a few friends.
We describe a new learning procedure, backpropagation, for networks of neuron-like units.
The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units.
The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.
A theory of serial order is proposed that attempts to deal both with the classical problem of the temporal organization of internally generated action sequences as well as with certain of the parallel aspects of sequential behavior.
The theory describes a dynamical system that is embodied as a parallel distributed processing or connectionist network. The trajectories of this dynamical system come to follow desired paths corresponding to particular action sequences as a result of a learning process during which constraints are imposed on the system. These constraints enforce sequentiality where necessary and, as they are relaxed, performance becomes more parallel.
The theory is applied to the problem of coarticulation in speech production and simulation experiments are presented.
This paper presents a generalization of the perception learning procedure for learning the correct sets of connections for arbitrary networks. The rule, called the “generalized delta rule”, is a simple scheme for implementing a gradient descent method for finding weights that minimize the sum squared error of the system’s performance. The major theoretical contribution of the work is the procedure called “error propagation”, whereby the gradient can be determined by individual units of the network based only on locally available information. The major empirical contribution of the work is to show that the problem of local minima is not serious in this application of gradient descent.
[Keywords: learning networks perceptrons adaptive systems learning machines, back propagation]
…In their pessimistic discussion of perceptrons, Minsky and Papert (1969) finally discuss multilayer machines near the end of their book. They state:
The perceptron has shown itself worthy of study despite (and even because of!) its severe limitations. It has many features that attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version, Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension is sterile. Perhaps some powerful convergence theorem will be discovered, or some profound reason for the failure to produce an interesting “learning theorem” for the multilayered machine will be found. (pp. 231–232)
Although our learning results do not guarantee that we can find a solution for all solvable problems, our analyses and results have shown that as a practical matter, the error propagation scheme leads to solutions in virtually every case. In short, we believe that we have answered Minsky and Papert’s challenge and have found a learning result sufficiently powerful to demonstrate that their pessimism about learning in multilayer machines was misplaced.
One way to view the procedure we have been describing is as a parallel computer that, having been shown the appropriate input/ output exemplars specifying some function, programs itself to compute that function in general. Parallel computers are notoriously difficult to program. Here we have a mechanism whereby we do not actually have to know how to write the program in order to get the system to do it. Parker (1985) has emphasized this point. [Learning-logic: casting the cortex of the human brain in silicon, (TR-47). Cambridge, MA: Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science]
On many occasions we have been surprised to learn of new methods of computing interesting functions by observing the behavior of our learning algorithm. This also raised the question of generalization. In most of the cases presented above, we have presented the system with the entire set of exemplars. It is interesting to ask what would happen if we presented only a subset of the exemplars at training time and then watched the system generalize to remaining exemplars. In small problems such as those presented here, the system sometimes finds solutions to the problems which do not properly generalize. However, preliminary results on larger problems are very encouraging in this regard. This research is still in progress and cannot be reported here. This is currently a very active interest of ours.
Finally, we should say that this work is not yet in a finished form. We have only begun our study of recurrent networks and sigma-pi units. We have not yet applied our learning procedure to many very complex problems. However, the results to date are encouraging and we are continuing our work.
[Predicting programs designed for large general-purpose computers constitute an important new tool in the control of production and economics. Nevertheless, small predicting filters have their own domain of application. They can be realized not only as programs for general-purpose computers, but also as simple analog devices with very fast response. The authors discuss three principal methods of prediction in addition to some others. Prediction of deterministic processes, ie. extrapolation and interpolation. Prediction of stochastic processes, based on statistical prediction theory. Prediction based on adaptation or learning of the predicting filters.]
The problem of optimal rocket flight in an inverse square law force field has been studied extensively by Lawden and Leitmann. Periods of zero thrust, intermediate thrust, and maximum thrust are possible subarcs of the solution according to analysis of the Euler-Lagrange equations and the Weierstrass necessary condition. Arcs of intermediate thrust have been examined recently by Lawden; however, the question of whether or not such arcs actually may furnish a minimum has been left unresolved. The present paper derives the singular extremals of Lawden’s problem by means of the Legendre-Clebsch necessary condition applied in a transformed system of state and control variables.
These are obtained as circular orbits along which the thrust is zero and intermediate thrust arcs are found in Lawden’s analysis. Since these solutions satisfy only the weak form of the Legendre-Clebsch condition, ie., the extremals are singular in the transformed system of variables, the question of their minimality remains unanswered.
A systematic and rapid steepest-ascent numerical procedure is described for solving two-point boundary-value problems in the calculus of variations for systems governed by a set of nonlinear ordinary differential equations. Numerical examples are presented for minimum time-to-climb and maximum altitude paths for a supersonic interceptor and maximum-range paths for an orbital glider.
…A systematic and rapid steepest-ascent numerical procedure is described for determining optimum programs for nonlinear systems with terminal constraints. The procedure uses the concept of local linearization around a nominal (non-optimum) path. The effect on the terminal conditions of a small change in the control variable program is determined by numerical integration of the adjoint differential equations for small perturbations about the nominal path. Having these adjoint (or influence) functions, it is then possible to determine the change in the control variable program that gives maximum increase in the pay-off function for a given mean-square perturbation of the control variable program while simultaneously changing the terminal quantities by desired amounts. By repeating this process in small steps, a control variable program that minimizes one quantity and yields specified values of other terminal quantities can be approached as closely as desired. Three numerical examples are presented: (a) The angle-of-attack program for a typical supersonic interceptor to climb to altitude in minimum time is determined with and without specified terminal velocity and heading. (b) The angle-of-attack program for the same interceptor to climb to maximum altitude is determined, (c) The angle-of-attack program is determined for a hypersonic orbital glider to obtain maximum surface range starting from satellite speed at 300,000 ft altitude.
The method of gradients also known as method of steepest descent is an elementary concept for the solution of minimum problems. In recent years the computational appeal of the method has led to its adoption in a variety of application such as multivariable minimum problems of ordinary calculus, solution of systems of algebraic equations, integral equations, and variational problems. This chapter begins with a discussion of the main features of the gradient method in the context of ordinary minimum problems subject to constraints. It also discusses the variational problems of flight performance, introducing Green’s functions in the role played by partial derivatives in ordinary minimum problems and attempting to preserve an analogy between the two classes of problems in the subsequent development. The close relationship between Green’s functions or influence functions and the error coefficients of guidance theory has drawn attention to the usefulness of the adjoint system technique in guidance analysis.