newsletter/2021/01 (Link Bibliography)

“newsletter/​2021/​01” links:

  1. 01

  2. https://gwern.substack.com

  3. 12

  4. newsletter

  5. Changelog

  6. https://www.patreon.com/gwern

  7. Danbooru2020

  8. ⁠, Nearcyan, Aydao, Shawn Presser⁠, Gwern Branwen (Tensorfork) (2021-01-19):

    [Website demonstrating samples from a modified trained on Danbooru2019 using TPUs for ~5m iterations for ~2 months on a pod; this modified ‘StyleGAN2-ext’, removes various regularizations which make StyleGAN2 data-efficient on datasets like ⁠, but hobble its ability to model complicated images, and scales the model up >2×. This is surprisingly effective given StyleGAN’s previous inability to approach BigGAN’s Danbooru2019, and TADNE shows off the entertaining results.

    The interface reuses Said Achmiz’s These Waifus Do Not Exist grid UI.

    Writeup⁠; see also: Colab notebook to search by CLIP embedding; ⁠/​​​​⁠/​​​​⁠, TADNE face editing⁠, =guided ponies]

    Screenshot of “This Anime Does Not Exist” infinite-scroll website.
  9. Faces#extended-stylegan2-danbooru2019-aydao

  10. ⁠, Gwern Branwen (2020-10-30):

    Subreddit for discussing AI, machine learning, or deep learning approaches involving big numbers: billions of parameters, millions of n, petaflops, etc. eg ⁠. Most research is conducted at much smaller scale; this subreddit is for research analogous to ‘high energy physics’, requiring specialized approaches, large investments, consortium, etc.

    Topics: How? Who? Why do they work? What are they good for? What resources are available? Who will pay & how? What is the future of such approaches? What global consequences will there be?

  11. ⁠, Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Mark Chen, Rewon Child, Vedant Misra, Pamela Mishkin, Gretchen Krueger, Sandhini Agarwal, Ilya Sutskever (2021-01-05):

    [Paper: ⁠, Ramesh et al 2021. Re-implementation: DALL·E Mini (writeup). cf ⁠, ⁠. Availability through OA API still planned as of 2021-09-05.] DALL·E is a 12-billion parameter version of trained to generate images from text descriptions, using a dataset of text-image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.

    GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. showed that the same type of neural network can also be used to generate images with high fidelity. [iGPT is another answer to the question of “how do we do images autoregressively, but not at the exorbitant cost of generating pixels 1 by 1?”; iGPT uses ‘super pixels’ & very small images, while DALL·E uses VAE ‘tokens’ corresponding roughly to small squares so the token sequence is relatively small, where the VAE does the actual compilation to raw pixels.] we extend these findings to show that manipulating visual concepts through language is now within reach.

    [3 DALL·E prompts: “an armchair in the shape of an avocado…” · “a store front that has the word ‘openai’ written on it…” · “the exact same cat on the top as a sketch on the bottom”]

    DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192. The images are preprocessed to 256×256 resolution during training. Similar to VQ-VAE, each image is compressed to a 32×32 grid of discrete codes using a discrete VAE1011 that we pretrained using a continuous relaxation. We found that training using the relaxation obviates the need for an explicit codebook, loss, or tricks like dead code revival, and can scale up to large vocabulary sizes.

    Capabilities: We find that DALL·E is able to create plausible images for a great variety of sentences that explore the compositional structure of language. We illustrate this using a series of interactive visuals in the next section. The samples shown for each caption in the visuals are obtained by taking the top 32 of 512 after reranking with ⁠, but we do not use any manual cherry-picking, aside from the thumbnails and standalone images that appear outside.

    1. Controlling attributes: We test DALL·E’s ability to modify several of an object’s attributes, as well as the number of times that it appears.

    2. Drawing multiple objects

    3. Visualizing perspective and three-dimensionality

    4. Visualizing internal and external structure

    5. Inferring contextual details

      We find that DALL·E is able to render the same scene in a variety of different styles, and can adapt the lighting, shadows, and environment based on the time of day or season: “a … of a capybara sitting in a field at sunrise”

    …With varying degrees of reliability, DALL·E provides access to a subset of the capabilities of a 3D rendering engine via natural language. It can independently control the attributes of a small number of objects, and to a limited extent, how many there are, and how they are arranged with respect to one another. It can also control the location and angle from which a scene is rendered, and can generate known objects in compliance with precise specifications of angle and lighting conditions.

    Zero-shot visual reasoning: GPT-3 can be instructed to perform many kinds of tasks solely from a description and a cue to generate the answer supplied in its prompt, without any additional training. For example, when prompted with the phrase “here is the sentence ‘a person walking his dog in the park’ translated into French:”, GPT-3 answers un homme qui promène son chien dans le parc.” This capability is called zero-shot reasoning. We find that DALL·E extends this capability to the visual domain, and is able to perform several kinds of image-to-image translation tasks when prompted in the right way. [See also CLIP.]

    We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it. Motivated by these results, we measure DALL·E’s aptitude for analogical reasoning problems by testing it on Raven’s Progressive Matrices, a visual IQ test that saw widespread use in the 20th century. Rather than treating the IQ test a multiple-choice problem as originally intended, we ask DALL·E to complete the bottom-right corner of each image using argmax sampling, and consider its completion to be correct if it is a close visual match to the original. DALL·E is often able to solve matrices that involve continuing simple patterns or basic geometric reasoning, such as those in sets B and C. It is sometimes able to solve matrices that involve recognizing permutations and applying boolean operations, such as those in set D. The instances in set E tend to be the most difficult, and DALL·E gets almost none of them correct. For each of the sets, we measure DALL·E’s performance on both the original images, and the images with the colors inverted. The inversion of colors should pose no additional difficulty for a human, yet does generally impair DALL·E’s performance, suggesting its capabilities may be brittle in unexpected ways.

  12. ⁠, Ali Razavi, Aaron van den Oord, Oriol Vinyals (2019-06-02):

    We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE) models for large scale image generation. To this end, we scale and enhance the autoregressive priors used in VQ-VAE to generate synthetic samples of much higher coherence and fidelity than possible before. We use simple feed-forward encoder and decoder networks, making our model an attractive candidate for applications where the encoding and/​​​​or decoding speed is critical. Additionally, VQ-VAE requires sampling an autoregressive model only in the compressed latent space, which is an order of magnitude faster than sampling in the pixel space, especially for large images. We demonstrate that a multi-scale hierarchical organization of VQ-VAE, augmented with powerful priors over the latent codes, is able to generate samples with quality that rivals that of state of the art Generative Adversarial Networks on multifaceted datasets such as ⁠, while not suffering from GAN’s known shortcomings such as mode collapse and lack of diversity.

  13. ⁠, Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal (2021-01-05):

    [] We present a neural network that aims to address these problems: it is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-25 and GPT-3.6 This is a key change: by not directly optimizing for the benchmark, we show that it becomes much more representative: our system closes this “robustness gap” by up to 75% while matching the performance of the original ResNet507 on ImageNet zero-shot without using any of the original 1.28M labeled examples.

    Approach: We show that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets. Our method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.

    In order to solve this task, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it.

    1. CLIP is highly efficient…In the end, our best performing CLIP model trains on 256 GPUs for 2 weeks which is similar to existing large scale image models.
    2. CLIP is flexible and general: Because they learn a wide range of visual concepts directly from natural language, CLIP models are substantially more flexible and general than existing ImageNet models. We find they are able to zero-shot perform many different tasks. To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR. [While CLIP’s zero-shot OCR performance is mixed, its semantic OCR representation is quite useful. When evaluated on the SST-2 NLP dataset rendered as images, a linear classifier on CLIP’s representation matches a CBoW model with direct access to the text. CLIP is also competitive at detecting hateful memes without needing ground truth text.] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models.

    …CLIP allows people to design their own classifiers and removes the need for task-specific training data. [See also ⁠, Guzhov et al 2021; CLIP notebook compilation for art⁠, “Alien Dreams: An Emerging Art Scene”⁠/​​​​“AI Generated Art Scene Explodes as Hackers Create Groundbreaking New Tools”⁠.]

  14. ⁠, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever (2021-01-05):

    [] State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision.

    We demonstrate that the simple pre-training [contrastive learning] task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification.

    The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.

    Figure 4: Prompt engineering and ensembling improve zero-shot performance. Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4× more compute with the baseline zero-shot method but is “free” when amortized over many predictions.
    Figure 5: Zero-shot CLIP is competitive with a fully supervised baseline. Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet.
    Figure 9: Zero-shot CLIP performance scales smoothly as a function of model compute. Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44× range of compute spanning 5 different CLIP models. Lightly shaded lines are performance on individual evals, showing that performance is much more varied despite the smooth overall trend.
    Figure 13: Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models. (Left) An ideal robust model (dashed line) performs equally well on the ImageNet distribution and on other natural image distributions. Zero-shot CLIP models shrink this “robustness gap” by up to 75%. Linear fits on logit transformed values are shown with bootstrap estimated 95% confidence intervals. (Right) Visualizing distribution shift for bananas, a class shared across 5 of the 7 natural distribution shift datasets. The performance of the best zero-shot CLIP model, ViT-L/​​​​14@336px, is compared with a model that has the same performance on the ImageNet validation set, ResNet-101.
    Figure 21: Visualization of predictions from 36 CLIP zero-shot classifiers. All examples are random with the exception of reselecting Hateful Memes to avoid offensive content. The predicted probability of the top 5 classes is shown along with the text used to represent the class. When more than one template is used, the first template is shown. The ground truth label is colored green while an incorrect prediction is colored orange.

    [Evaluations: Food101 · Sun398 · Youtube-BB · EuroSAT · PatchCamelyon (PCam) · ImageNet-A (Adversarial) · CIFAR-10 · CLEVR Count · Facial Emotion Recognition 2013 (FER2013) · UCF101 · Caltech-101 · ImageNet-R (Rendition) · Oxford-IIIT Pets · CIFAR-100 · ImageNetV2 Matched Frequency · FGVC Aircraft · Country211 · RESISC45 · Stanford Cars · SUN · Kinetics-700 · Flowers-102 · ImageNet · Birdsnap · aYahoo · ObjectNet ImageNet Overlap · ImageNet Blurry · Describable Textures Dataset (DTD) · PASCAL VOC 2007 · MNIST · Street View House Numbers (SVHN) · ImageNet Vid · ImageNet Sketch · Hateful Memes · Stanford Sentiment Treebank · German Traffic Sign Recognition Benchmark (GTSRB)]

  15. Scaling-hypothesis#blessings-of-scale

  16. ⁠, Phuc H. Le-Khac, Graham Healy, Alan F. Smeaton (2020-10-10):

    Contrastive Learning has recently received interest due to its success in self-supervised representation learning in the computer vision domain. However, the origins of Contrastive Learning date as far back as the 1990s and its development has spanned across many fields and domains including Metric Learning and natural language processing. In this paper we provide a comprehensive literature review and we propose a general Contrastive Representation Learning framework that simplifies and unifies many different contrastive learning methods. We also provide a taxonomy for each of the components of contrastive learning in order to summarise it and distinguish it from other forms of machine learning. We then discuss the inductive biases which are present in any contrastive learning system and we analyse our framework under different views from various sub-fields of Machine Learning. Examples of how contrastive learning has been applied in computer vision, natural language processing, audio processing, and others, as well as in are also presented. Finally, we discuss the challenges and some of the most promising future research directions ahead.

  17. ⁠, Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby (2020-09-28):

    One-sentence Summary: Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification.

    While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data [JFT-300M] and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc), Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train…Our Vision Transformer, pre-trained on the JFT-300M dataset, approaches or beats state of the art on multiple image recognition benchmarks, reaching accuracy of 88.36% on ImageNet, 90.77% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.16% on the VTAB suite of 19 tasks…Interestingly, our models took substantially less compute to pre-train than state of the art, however, we note that pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, ⁠, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4…Finally, [we plan] to further scale ViT, given that the performance does not seem yet to be saturating with the increased model size.

    [Keywords: computer vision, image recognition, self-attention, transformer, large-scale training]

    [Blog⁠. See also ⁠.]

  18. ⁠, Andrew Brock, Jeff Donahue, Karen Simonyan (2018-09-28):

    Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple “truncation trick,” allowing fine control over the trade-off between sample fidelity and variety by reducing the of the Generator’s input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance () of 7.4, improving over the previous best IS of 52.52 and FID of 18.6.

  19. https://nitter.hu/quasimondo/status/1351191660059832320

  20. ⁠, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2015-12-10):

    Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

    The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

  21. ⁠, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2018-10-11):

    We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

    BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

  22. ⁠, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020-05-28):

    Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions—something which current NLP systems still largely struggle to do.

    Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

    Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

    …The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.

  23. 1997-hochreiter.pdf: ⁠, Sepp Hochreiter, Jürgen Schmidhuber (1997-12-15; ai):

    Learning to store information over extended time intervals by recurrent takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory ().

    Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its per time step and weight is 𝒪(1).

    Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

  24. ⁠, Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (2012-12):

    We trained a large, deep to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

  25. ⁠, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017-06-12):

    The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring statistically-significantly less time to train. Our model achieves 28.4 on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

  26. ⁠, Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, Ilya Sutskever () (2019-02-14):

    Our model, called GPT-2 (a successor to GPT), was trained simply to predict the next word in 40GB of Internet text. Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a ⁠.

    GPT-2 is a large -based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10× the parameters and trained on more than 10× the amount of data.

    GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

  27. ⁠, Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton (2020-02-13):

    This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels.

  28. ⁠, Hans Moravec (1998):

    This paper describes how the performance of AI machines tends to improve at the same pace that AI researchers get access to faster hardware. The processing power and memory capacity necessary to match general intellectual performance of the human brain are estimated. Based on extrapolation of past trends and on examination of technologies under development, it is predicted that the required hardware will be available in cheap machines in the 2020s…At the present rate, computers suitable for human-like robots will appear in the 2020s. Can the pace be sustained for another three decades?

    …By 1990, entire careers had passed in the frozen winter of 1-MIPS computers, mainly from necessity, but partly from habit and a lingering opinion that the early machines really should have been powerful enough. In 1990, 1 MIPS cost $2,338$1,0001990 in a low-end personal computer. There was no need to go any lower. Finally spring thaw has come. Since 1990, the power available to individual AI and robotics programs has doubled yearly, to 30 MIPS by 1994 and 500 MIPS by 1998. Seeds long ago alleged barren are suddenly sprouting. Machines read text, recognize speech, even translate languages. Robots drive cross-country, crawl across Mars, and trundle down office corridors. In 1996 a theorem-proving program called running five weeks on a 50 MIPS computer at Argonne National Laboratory found a proof of a boolean algebra conjecture by Herbert Robbins that had eluded mathematicians for sixty years. And it is still only spring. Wait until summer.

    …The mental steps underlying good human chess playing and theorem proving are complex and hidden, putting a mechanical interpretation out of reach. Those who can follow the play naturally describe it instead in mentalistic language, using terms like strategy, understanding and creativity. When a machine manages to be simultaneously meaningful and surprising in the same rich way, it too compels a mentalistic interpretation. Of course, somewhere behind the scenes, there are programmers who, in principle, have a mechanical interpretation. But even for them, that interpretation loses its grip as the working program fills its memory with details too voluminous for them to grasp.

    As the rising flood reaches more populated heights, machines will begin to do well in areas a greater number can appreciate. The visceral sense of a thinking presence in machinery will become increasingly widespread. When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self-evident.

    Faster than Exponential Growth in Computing Power: The number of MIPS in $1,854$10001998 of computer from 1900 to the present. Steady improvements in mechanical and electromechanical calculators before World War II had increased the speed of calculation a thousandfold over manual methods from 1900 to 1940. The pace quickened with the appearance of electronic computers during the war, and 1940 to 1980 saw a million-fold increase. The pace has been even quicker since then, a pace which would make human-like robots possible before the middle of the next century. The vertical scale is logarithmic, the major divisions represent thousandfold increases in computer performance. Exponential growth would show as a straight line, the upward curve indicates faster than exponential growth, or, equivalently, an accelerating rate of innovation. The reduced spread of the data in the 1990s is probably the result of intensified competition: underperforming machines are more rapidly squeezed out. The numerical data for this power curve are presented in the appendix.
    The big freeze: From 1960 to 1990 the cost of computers used in AI research declined, as their numbers dilution absorbed computer-efficiency gains during the period, and the power available to individual AI programs remained almost unchanged at 1 MIPS, barely insect power. AI computer cost bottomed in 1990, and since then power has doubled yearly, to several hundred MIPS by 1998. The major visible exception is computer chess (shown by a progression of knights), whose prestige lured the resources of major computer companies and the talents of programmers and machine designers. Exceptions also exist in less public competitions, like petroleum exploration and intelligence gathering, whose high return on investment gave them regular access to the largest computers.
  29. ⁠, Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, Quoc V. Le (2021-01-05):

    We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet [using JFT-300M], which is 1.6% better than the existing state-of-the-art. Like ⁠, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher in Meta Pseudo Labels is constantly adapted by the feedback of the student’s performance on the labeled dataset. As a result, the teacher generates better pseudo labels to teach the student. Our code will be available at this URL⁠.

  30. ⁠, William Fedus, Barret Zoph, Noam Shazeer (2021-01-11):

    In deep learning, models typically reuse the same parameters for all inputs. Mixture-of-Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model—with outrageous numbers of parameters—but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability—we address these with the Switch Transformer.

    We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large () to obtain up to 7× increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the -Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to 1-trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4× speedup over the T5-XXL model.

    Figure 1: Scaling and sample efficiency of Switch Transformers. Left Plot: Scaling properties for increasingly sparse (more experts) Switch Transformers. Right Plot: Negative log-perplexity.

    Appendix E: Relation OF Upstream To Downstream Model Performance

    There is no guarantee that a model’s quality on a pre-training objective will translate to downstream task results. Figure 13 presents the correlation of the upstream model quality, for both dense and Switch models, on the C4 pre-training task with two downstream task measures: average Super-GLUE performance and TriviaQA score. We choose these two tasks as one probes the model’s reasoning and the other factual knowledge.

    Figure 13: Upstream pre-trained quality to downstream model quality. We correlate the upstream performance with downstream quality on both SuperGLUE and TriviaQA (SOTA recorded without SSM), reasoning and knowledge-heavy benchmarks, respectively (validation sets). We find that, as with the baseline, the Switch model scales with improvements in the upstream pre-training task. For SuperGLUE, we find a loosely linear relation between negative log perplexity and the average SuperGLUE score. However, the dense model often performs better for a fixed perplexity, particularly in the large-scale regime. Conversely, on the knowledge-heavy task, TriviaQA, we find that the Switch Transformer may follow an improved scaling relationship—for a given upstream perplexity, it does better than a dense counterpart. Further statistics (expensive to collect and left to future work) would be necessary to confirm these observations.

    We find a consistent correlation, indicating that for both baseline and Switch models, improved pre-training leads to better downstream results. Additionally, for a fixed upstream perplexity we find that both Switch and dense models perform similarly in the small to medium model size regime. However, in the largest model regime (T5-11B/​​​​T5-XXL) our largest Switch models, as mentioned in Section 5.6⁠, do not always translate their upstream perplexity well to downstream fine-tuning on the SuperGLUE task. This warrants future investigation and study to fully realize the potential of sparse models. Understanding the fine-tuning dynamics with expert-models is very complicated and is dependent on regularization, load-balancing, and fine-tuning hyper-parameters.

  31. ⁠, Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen (2020-06-30):

    Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

  32. ⁠, Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou (2020-12-23):

    Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.

    In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.

    More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

    [See also ⁠, d’Ascoli et al 2021; ⁠, Zhou et al 2021; also of interest: Neyshabur 2020, ⁠; d’Ascoli et al 2019, ⁠; Anandkumar et al 2016, ]

  33. ⁠, Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani (2021-01-27):

    We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 2.33× faster in compute time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.

  34. ⁠, Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, Shuicheng Yan (2021-01-28):

    Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, eg., the (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance compared with CNNs when trained from scratch on a midsize dataset (eg., ImageNet). We find it is because: (1) the simple tokenization of input images fails to model the important local structure (eg., edges, lines) among neighboring pixels, leading to its low training sample efficiency; (2) the redundant attention backbone design of ViT leads to limited feature richness in fixed computation budgets and limited training samples.

    To overcome such limitations, we propose a new Tokens-To-Token Vision Transformers (T2T-ViT), which introduces: (1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; (2) an efficient backbone with a deep-narrow structure for vision transformers motivated by CNN architecture design after extensive study.

    Notably, T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets when directly training on ImageNet. For example, T2T-ViT with ResNet50-comparable-size can achieve 80.7% top-1 accuracy on ImageNet. (Code)

  35. ⁠, Seungwook Han, Akash Srivastava, Cole Hurwitz, Prasanna Sattigeri, David D. Cox (2020-09-09):

    State-of-the-art models for high-resolution image generation, such as BigGAN and VQVAE-2, require an incredible amount of compute resources and/​​​​or time (512 TPU-v3 cores) to train, putting them out of reach for the larger research community. On the other hand, GAN-based image super-resolution models, such as ESRGAN, can not only upscale images to high dimensions, but also are efficient to train. In this paper, we present not-so-big- (nsb-GAN), a simple yet cost-effective two-step training framework for deep generative models (DGMs) of high-dimensional natural images. First, we generate images in low-frequency bands by training a sampler in the wavelet domain. Then, we super-resolve these images from the wavelet domain back to the pixel-space with our novel wavelet super-resolution decoder network. Wavelet-based down-sampling method preserves more structural information than pixel-based methods, leading to significantly better generative quality of the low-resolution sampler (e.g., 64×64). Since the sampler and decoder can be trained in parallel and operate on much lower dimensional spaces than end-to-end models, the training cost is substantially reduced. On ImageNet 512×512, our model achieves a Fréchet Inception Distance (FID) of 10.59 – beating the baseline BigGAN model—at half the compute (256 TPU-v3 cores).

  36. ⁠, Patrick Esser, Robin Rombach, Björn Ommer (2020-12-17):

    Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (1) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (2) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers. [Github⁠; ]

    TL;DR: We combine the efficiency of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer.
  37. 2019-lecun-isscctalk-cake.png

  38. ⁠, Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He (2021-01-18):

    Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters.

    ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10× increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency.

    ZeRO-Offload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize the data movement to/​​​​from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeRO-Offload can achieve 40 TFLOPS/​​​​GPU on a single NVIDIA GPU for 10B parameter model compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running out of memory. ZeRO-Offload is also designed to scale on multiple-GPUs when available, offering near linear speedup on up to 128 GPUs. Additionally, it can work together with model parallelism to train models with over 70 billion parameters on a single box, a 4.5× increase in model size compared to using model parallelism alone.

    By combining compute and memory efficiency with ease-of-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.

    Figure 2: The dataflow of fully connected neural networks with M parameters. We use activation checkpoints to reduce activation memory to avoid activation migration between CPU and GPU.
  39. ⁠, Xiang Lisa Li, Percy Liang (2021-01-01):

    Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task.

    In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”.

    We apply prefix-tuning to GPT-2 for table-to-text generation and to for summarization. We find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.

    [See also ⁠.]

  40. ⁠, Timo Schick, Hinrich Schütze (2020-09-15):

    When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models.

  41. ⁠, Sam Greydanus (2020-12-01):

    …Yet in spite of its historical importance, MNIST has three notable shortcomings. First, it does a poor job of differentiating between linear, nonlinear, and translation-invariant models. For example, logistic, MLP, and CNN benchmarks obtain 94, 99+, and 99+% accuracy on it. This makes it hard to measure the contribution of a CNN’s spatial priors or to judge the relative effectiveness of different regularization schemes. Second, it is somewhat large for a toy dataset. Each input example is a 784-dimensional vector and thus it takes a non-trivial amount of computation to perform hyperparameter searches or debug a meta-learning loop. Third, MNIST is hard to hack. The ideal toy dataset should be procedurally generated so that researchers can smoothly vary parameters such as background noise, translation, and resolution.

    In order to address these shortcomings, we propose the MNIST-1D dataset. It is a minimalist, low-memory, and low-compute alternative to MNIST, designed for exploratory deep learning research where rapid iteration is a priority. Training examples are 20 times smaller but they are still better at measuring the difference between (1) linear and nonlinear classifiers and (2) models with and without spatial inductive biases (eg. translation invariance). The dataset is procedurally generated but still permits analogies to real-world digit classification…Unlike MNIST, each example is a one-dimensional sequence of points. To generate an example, we begin with a digit template and then randomly pad, translate, and transform it.

    Example use cases: In this section we will explore several examples of how MNIST-1D can be used to study core “science of deep learning” phenomena.

    1. Finding lottery tickets…Unlike many follow-up experiments on the lottery ticket, this one took just two days of researcher time to produce. The curious reader can also reproduce these results in their browser in a few minutes.

    2. Observing deep double descent…We see the MNIST-1D dataset as a good tool for exploring these properties. In fact, we were able to reproduce the pattern after a few hours of researcher effort. The figure below shows our results for a fully-connected network and a convolutional model.

    3. Gradient-based meta-learning…A model does this by having two levels of optimization: the first is a fast inner loop which corresponds to a traditional learning objective and second is a slow outer loop which updates the “meta” properties of the learning process…Meta-learning is a promising topic but it is very difficult to scale. First of all, meta-learning algorithms consume enormous amounts of time and compute. Second of all, implementations tend to grow complex since there are twice as many hyperparameters (one set for each level of optimization) and most deep learning frameworks are not set up well for meta-learning. This places an especially high incentive on debugging and iterating meta-learning algorithms on small-scale datasets such as MNIST-1D. For example, it took just a few hours to implement and debug the gradient-based hyperparameter optimization of a learning rate shown below.

      • Meta-learning an activation function: Having implemented a “minimal working example” of gradient-based meta-learning, we realized that it permitted a simple and novel extension: meta-learning an activation function. With a few more hours of researcher time, we were able to parameterize our classifier’s activation function with a second neural network and then learn the weights using meta-gradients.
    4. Measuring the spatial priors of deep networks: …Principle among these priors is the translation invariance of convolution. A primary motivation for this dataset was to construct a toy problem that could effectively quantify a model’s spatial priors. The second figure in this post illustrates that this is indeed possible with MNIST-1D.

    5. Benchmarking pooling methods. Our final case study begins with a specific question: What is the relationship between pooling and sample efficiency? We had not seen evidence that pooling makes models more or less sample efficient, but this seemed an important relationship to understand. With this in mind, we trained models with different pooling methods and training set sizes and found that, while pooling tended to be effective in low-data regimes, it did not make much of a difference in high-data regimes.

    …this post argues in favor of small-scale machine learning research. Neural networks do not have problems with scaling or performance—but they do have problems with interpretability, reproducibility, and iteration speed. We see carefully-controlled, small-scale experiments as a great way to address these problems…For example, several of the findings reported in this post are at the point where they should be investigated at scale. We would like to show that large scale lottery tickets also learn spatial inductive biases, and show evidence that they develop local connectivity. We would also like to try meta-learning an activation function on a larger model in the hopes of finding an activation that will outperform and Swish in generality. We should emphasize that we are only ready to scale these results now that we have isolated and understood them in a controlled setting. We believe that scaling a system is only a good idea once the relevant causal mechanisms have been isolated and understood. [cf scaling law papers] …Our work also bears philosophical similarities to the ⁠.

    Closing Thoughts: There is a counterintuitive possibility that in order to explore the limits of how large we can scale neural networks, we may need to explore the limits of how small we can scale them first. Scaling models and datasets downward in a way that preserves the nuances of their behaviors at scale will allow researchers to iterate quickly on fundamental and creative ideas. This fast iteration cycle is the best way of obtaining insights about how to incorporate progressively more complex inductive biases into our models. We can then transfer these inductive biases across spatial scales in order to dramatically improve the sample efficiency and generalization properties of large-scale models. We see the humble MNIST-1D dataset as a first step in that direction.

  42. Faster

  43. ⁠, Adam P. Goucher (2021-01-08):

    The real cleverness of neural network is that it’s an efficiently-updatable neural network (NNUE). Specifically, it’s a simple feedforward network with:

    • a large (10.5M parameters!) input layer, illustrated below, that can utilise two different levels of sparsity for computational efficiency;
    • three much smaller layers (with 17.5k parameters in total) which are evaluated densely using vector instructions;
    • a single scalar output to give a numerical score for the position, indicating how favourable it is for the player about to move.

    Everything is done using integer arithmetic, with 16-bit weights in the first layer and 8-bit weights in the remaining layers…The inputs to the layer are two sparse binary arrays, each consisting of 41,024 elements. It may seem highly redundant to encode a chess position using 82048 binary features, but this is similar to an approach (called ‘feature crosses’) used in recommender systems.

    …There are two levels of sparsity which are utilised when computing this affine transformation from ℝ41,024 to ℝ256, allowing the network to be efficiently evaluated many times in a tree search:

    • the 41,024-element implicit vectors are themselves sparse: the number of nonzero elements is equal to the number of non-king pieces on the board.
    • moving a piece typically changes very few of the entries of the vector: if it’s a regular non-king move, only 2 entries change; if it’s a non-king move with capture, then 3 entries change.

    It’s this second aspect which warrants the name ‘efficiently updatable’: when a move is made (or unmade, since we’re doing a tree search), we only need to add/​​​​subtract a few 256-element matrix columns from the resulting ‘dense worldview’ to update it.

    Unless a king is moved, this (2 or 3 vector additions/​​​​subtractions) beats summing all of the matrix columns corresponding to nonzero entries (up to 30 vector additions), which in turn unconditionally beats doing a regular dense matrix-vector multiplication (41,024 vector additions). That is to say, the second-level sparsity is about 10× more efficient than the first-level sparsity, which is in turn about 1000× more efficient than naively doing a dense matrix-vector multiplication.

  44. ⁠, Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah (2021-01-04):

    Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

  45. ⁠, OpenAI (2020-12-29):

    Today we’re announcing that Dario Amodei, VP of Research, is leaving OpenAI after nearly five years with the company. Dario has made tremendous contributions to our research in that time, collaborating with the team to build GPT-2 and GPT-3, and working with Ilya Sutskever as co-leader in setting the direction for our research.

    Dario has always shared our goal of responsible AI. He and a handful of OpenAI colleagues are planning a new project, which they tell us will probably focus less on product development and more on research. We support their move and we’re grateful for the time we’ve spent working together.

    “We are incredibly thankful to Dario for his contributions over the past four and a half years. We wish him and his co-founders all the best in their new project, and we look forward to a collaborative relationship with them for years to come”, said OpenAI chief executive Sam Altman.

    When his departure was announced at an employee meeting earlier this month, Dario told coworkers, “I want to thank Sam and thank everyone. I’m really proud of the work we’ve done together. I want to wish everyone the best, and I know that OpenAI will do really great things in the years ahead. We share the same goal of safe artificial general intelligence to benefit humanity, so it’s incumbent on all of us in this space to work together to make sure things go well.”

    OpenAI is also making a few organizational changes to put greater focus on the integration of research, product, and safety. Mira Murati is taking on new responsibilities as senior vice president of Research, Product, and Partnerships, reflecting her strong leadership during our API rollout and across the company.

    Sam added, “OpenAI’s mission is to thoughtfully and responsibly develop general-purpose artificial intelligence, and as we enter the new year our focus on research—especially in the area of safety—has never been stronger. Making AI safer is a company-wide priority, and a key part of Mira’s new role.”

  46. ⁠, Steve Blank (2009-12-21):

    Sometimes financial decisions that are seemingly rational on their face can precipitate mass exodus of your best engineers…“Do you know how much our company is spending on free sodas and snacks?” And to answer her own question she presented the spreadsheet totaling it all up. There were some experienced VC’s in the room and I was waiting for them to “educate” her about startup culture. But my jaw dropped when the board agreed that the “free stuff” had to go… I had lived through this same conversation four times in my career, and each time it ended as an example of unintended consequences. No one on the board or the executive staff was trying to be stupid. But to save $23,376$10,0001990 or so, they unintentionally launched an exodus of their best engineers.

    The Elves Leave Middle Earth—Sodas Are No Longer Free: One day the engineering team was clustered in the snack room looking at the soda machine. The sign said, “Soda now 50 cents.” The uproar began. Engineers started complaining about the price of the soda. Someone noticed that instead of the informal reimbursement system for dinners when they were working late, there was now a formal expense report system. Some had already been irritated when “professional” managers had been hired over their teams with reportedly more stock than the early engineers had. Lots of email was exchanged about “how things were changing for the worse.” A few engineers went to the see the CEO.

    But the damage had been done. The most talented and senior engineers looked up from their desks and noticed the company was no longer the one they loved. It had changed. And not in a way they were happy with.

    The best engineers quietly put the word out that they were available, and in less than month the best and the brightest began to drift away…The engineers focused on building product never noticed when the company had grown into something different than what they first joined.

    The sodas were just the wake-up call.

  47. https://www.lesswrong.com/posts/pTYDdcag9pTzFQ7vw/2020-ai-alignment-literature-review-and-charity-comparison

  48. ⁠, Felix Hill, Olivier Tieleman, Tamara von Glehn, Nathaniel Wong, Hamza Merzic, Stephen Clark (2020-09-03):

    [Previously: ⁠, Merel et al 2020; see also ⁠.] Recent work has shown that large text-based neural language models, trained with conventional supervised learning objectives, acquire a surprising propensity for few-shot and one-shot learning.

    Here, we show that an embodied agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms. After a single introduction to a novel object via continuous visual perception and a language prompt (“This is a dax”), the agent can re-identify the object and manipulate it as instructed (“Put the dax on the bed”). In doing so, it seamlessly integrates short-term, within-episode knowledge of the appropriate referent for the word “dax” with long-term lexical and motor knowledge acquired across episodes (ie. “bed” and “putting”).

    We find that, under certain training conditions and with a particular memory writing mechanism, the agent’s one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for later executing instructions.

    Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for ‘fast-mapping’, a fundamental pillar of human cognitive development and a potentially transformative capacity for agents that interact with human users.

  49. ⁠, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen (2020-06-05):

    Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models’ generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

  50. ⁠, Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman (2019-05-02):

    In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The ⁠, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research.

    In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard⁠. SuperGLUE is available at super.gluebenchmark.com⁠.

  51. https://super.gluebenchmark.com/leaderboard/

  52. ⁠, Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O''Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi, Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol Vinyals, Yori Zwols (2020-12-23):

    [Followup: ] (MIP) solvers rely on an array of sophisticated heuristics developed with decades of research to solve large-scale MIP instances encountered in practice. Machine learning offers to automatically construct better heuristics from data by exploiting shared structure among instances in the data.

    This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one. Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP. Neural Diving learns a deep neural network to generate multiple partial assignments for its integer variables, and the resulting smaller MIPs for un-assigned variables are solved with SCIP to construct high quality joint assignments. Neural Branching learns a deep neural network to make variable selection decisions in to bound the objective value gap with a small tree. This is done by imitating a new variant of Full Strong Branching we propose that scales to large instances using GPUs.

    We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each. Most instances in all the datasets combined have 103–106 variables and constraints after presolve, which is substantially larger than previous learning approaches. Comparing solvers with respect to primal-dual gap averaged over a held-out set of instances, the learning-augmented SCIP is 2× to 10× better on all datasets except one on which it is 105× better, at large time limits. To the best of our knowledge, ours is the first learning approach to demonstrate such large improvements over SCIP on both large-scale real-world application datasets and MIPLIB.

  53. ⁠, Nicolas Sonnerat, Pengming Wang, Ira Ktena, Sergey Bartunov, Vinod Nair (2021-07-21):

    Large Neighborhood Search (LNS) is a combinatorial optimization heuristic that starts with an assignment of values for the variables to be optimized, and iteratively improves it by searching a large neighborhood around the current assignment. In this paper we consider a learning-based LNS approach for mixed integer programs (MIPs). We train a Neural Diving model to represent a probability distribution over assignments, which, together with an off-the-shelf MIP solver, generates an initial assignment. Formulating the subsequent search steps as a Markov Decision Process, we train a Neural Neighborhood Selection policy to select a search neighborhood at each step, which is searched using a MIP solver to find the next assignment. The policy network is trained using imitation learning. We propose a target policy for imitation that, given enough compute resources, is guaranteed to select the neighborhood containing the optimal next assignment amongst all possible choices for the neighborhood of a specified size. Our approach matches or outperforms all the baselines on five real-world MIP datasets with large-scale instances from diverse applications, including two production applications at Google. It achieves 2× to 37.8× better average primal gap than the best baseline on three of the datasets at large running times.

  54. ⁠, Ryota Hinami, Shonosuke Ishiwatari, Kazuhiko Yasuda, Yusuke Matsui (2020-12-28):

    We tackle the problem of machine translation of manga, Japanese comics. Manga translation involves two important problems in machine translation: context-aware and multimodal translation. Since text and images are mixed up in an unstructured fashion in Manga, obtaining context from the image is essential for manga translation. However, it is still an open problem how to extract context from image and integrate into MT models. In addition, corpus and benchmarks to train and evaluate such model is currently unavailable. In this paper, we make the following four contributions that establishes the foundation of manga translation research. First, we propose multimodal context-aware translation framework. We are the first to incorporate context information obtained from manga image. It enables us to translate texts in speech bubbles that cannot be translated without using context information (e.g., texts in other speech bubbles, gender of speakers, etc.). Second, for training the model, we propose the approach to automatic corpus construction from pairs of original manga and their translations, by which large parallel corpus can be constructed without any manual labeling. Third, we created a new benchmark to evaluate manga translation. Finally, on top of our proposed methods, we devised a first comprehensive system for fully automated manga translation.

  55. ⁠, Siyi Hu, Fengda Zhu, Xiaojun Chang, Xiaodan Liang (2021-01-20):

    Recent advances in multi-agent reinforcement learning have been largely limited in training one model from scratch for every new task. The limitation is due to the restricted model architecture related to fixed input and output dimensions. This hinders the experience accumulation and transfer of the learned agent over tasks with diverse levels of difficulty (e.g. 3 vs 3 or 5 vs 6 multi-agent games). In this paper, we make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks with the requirement of different observation and action configurations. Unlike previous -based models, we utilize a transformer-based model to generate a flexible policy by decoupling the policy distribution from the intertwined input observation with an importance weight measured by the merits of the self-attention mechanism. Compared to a standard transformer block, the proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task’s decision process more explainable. UPDeT is general enough to be plugged into any multi-agent reinforcement learning pipeline and equip them with strong generalization abilities that enables the handling of multiple tasks at a time. Extensive experiments on large-scale SMAC multi-agent competitive games demonstrate that the proposed UPDeT-based multi-agent reinforcement learning achieves significant results relative to state-of-the-art approaches, demonstrating advantageous transfer capability in terms of both performance and training speed (10 times faster).

  56. ⁠, Albert Zhan, Philip Zhao, Lerrel Pinto, Pieter Abbeel, Michael Laskin (2020-12-14):

    Data-efficient learning of manipulation policies from visual observations is an outstanding challenge for real-robot learning. While deep reinforcement learning (RL) algorithms have shown success learning policies from visual observations, they still require an impractical number of real-world data samples to learn effective policies. However, recent advances in unsupervised representation learning and data augmentation significantly improved the sample efficiency of training RL policies on common simulated benchmarks. Building on these advances, we present a Framework for Efficient Robotic Manipulation (FERM) that utilizes data augmentation and unsupervised learning to achieve extremely sample-efficient training of robotic manipulation policies with sparse rewards. We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels, such as reaching, picking, moving, pulling a large object, flipping a switch, and opening a drawer in just 15–50 minutes of real-world training time. We include videos, code, and additional information on the project website—https:/​​​​/​​​​sites.google.com/​​​​view/​​​​efficient-robotic-manipulation.

  57. ⁠, Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang (2021-01-12):

    [blog] The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text.

    It does this via multiple contrastive losses which capture inter-modality and intra-modality correspondences. XMC-GAN uses an attentional self-modulation generator, which enforces strong text-image correspondence, and a contrastive discriminator, which acts as a critic as well as a feature encoder for contrastive learning.

    The quality of XMC-GAN’s output is a major step up from previous models, as we show on three challenging datasets. On ⁠, not only does XMC-GAN improve state-of-the-art FID from 24.70 to 9.33, but—more importantly—people prefer XMC-GAN by 77.3 for image quality and 74.1 for image-text alignment, compared to three other recent models. XMC-GAN also generalizes to the challenging Localized Narratives dataset (which has longer, more detailed descriptions), improving state-of-the-art FID from 48.70 to 14.12. Lastly, we train and evaluate XMC-GAN on the challenging Open Images data, establishing a strong benchmark FID score of 26.91.

  58. ⁠, Sara A. Hart, Callie Little, Elsje van Bergen (2021-01-08):

    Across a wide range of studies, researchers often conclude that the home environment and children’s outcomes are causally linked. In contrast, behavioral genetic studies show that parents influence their children by providing them with both environment and genes, meaning the environment that parents provide should not be considered in the absence of genetic influences, because that can lead to erroneous conclusions on causation. This article seeks to provide behavioral scientists with a synopsis of numerous methods to estimate the direct effect of the environment, controlling for the potential of genetic ⁠. Ideally, using genetically sensitive designs can fully disentangle this genetic confound, but these require specialized samples. In the near future, researchers will likely have access to measured DNA variants (summarized in a polygenic scores), which could serve as a partial genetic control, but that is currently not an option that is ideal or widely available. We also propose a work around for when genetically sensitive data are not readily available: the Familial Control Method. In this method, one measures the same trait in the parents as the child, and the parents’ trait is then used as a covariate (eg., a genetic proxy). When these options are all not possible, we plead with our colleagues to clearly mention genetic confound as a limitation, and to be cautious with any environmental causal statements which could lead to unnecessary parent blaming.

    Most parents spend hours fretting over decisions about the environment they provide to their children. The scientific literature mirrors this idea. Across a wide range of studies from many psychological domains, researchers often conclude that the environment parents provide and children’s outcomes are causally linked, through environmental transmission (see Box 1). For example, a study examining the association of having a home library as an adolescent and later adult literacy, numeracy and technology skills drew our attention because of in-depth coverage in the Guardian⁠. This study used a very rich and dataset, and found a correlation between the number of books in adolescents’ homes and literacy performance in adulthood. They conclude that “growing up with home libraries boosts adult skills”, inferring a causal connection. This is depicted in Figure 1. Here we discuss how the correlation between the environments parents provide, the “rearing environment”, and their children’s outcomes can indeed be fully due to a causal association, or importantly, can also be partly or fully due to a genetic confounding, illustrated in Figure 2 (see Footnote 1 in the Supplementary Notes). After highlighting the problem, we suggest ways that psychological scientists can examine research questions related to the rearing environment and children’s outcomes in ways that account for, or at least acknowledge, genetic confounding.

    An example of how genetic confounding works (note, only one parent drawn, for simplicity). Parents share genes related to reading ability with their children, and also control the number of books in their home. This creates gene-environment interplay. It is important to note that the environmental effect may still have a causal role, even with gene-environment interplay. If genes play a role but are not modeled (as in Figure 1), the correlation between the environmental measure and the child’s trait is genetically confounded. Here, the role of genes is modeled, allowing for an estimation of the genetic effect and the environmental effect.
  59. ⁠, Eco J. C. de Geus (2020-12-14):

    • Triangulation across the results from genetically informative designs supports the existence of causal effects of exercise on mental health as well as residual confounding by genetic factors that independently influence participation in regular exercise and mental health outcomes.
    • A model explaining the heritability of voluntary exercise behaviour in terms of genetic moderation of its positive mental health effects can explain how causal effects co-exist with genetic pleiotropy.
    • The model calls for further research with strategies that use genomic information to improve the success of interventions on regular exercise behaviour.

    Regular exercise is associated with mental health throughout the life course but the chain-of-causality underlying this association remains contested. I review results from genetically informative designs that examine causality, including the discordant monozygotic twin design, multivariate genetic models, ⁠, and stratification on polygenic risk scores. Triangulation across the results from these and the standard designs for causal inference (⁠, prospective studies) in the extant literature supports the existence of causal effects of exercise on mental health as well as residual confounding by genetic factors that independently influence participation in regular exercise and mental health outcomes. I present an update of our earlier model for the genetic determinants of voluntary exercise behaviour. The model allows causal effects of regular exercise on mental health to co-exist with genetic pleiotropy through differences in the genetic sensitivity to the mental health benefits of exercise. The model encourages research on strategies that use genomic information to improve the success of interventions on regular exercise behaviour.

    [Keywords: twin study, Mendelian randomization, polygenic risk score, exercise psychology, personalized medicine]

    Figure 3: Genetic correlation between exercise behaviour and mental health. Note: The higher order latent genetic factor in the oval on the left contains all sets of genetic variants that explain the heritability of regular voluntary exercise behaviour. The sets of variants that are relevant for the model (G1 through G8) are repeated in the figure close to the traits where they apply. By influencing the causal mechanisms through which exercise influences mental health, these genetic variants create a genetic correlation between exercise and mental health. This genetic pleiotropy is indicated by the large dashed arrows.
  60. 2020-schnurr.pdf: ⁠, Theresia M. Schnurr, Bente M. Stallknecht, Thorkild I.A. Sørensen, Tuomas O. Kilpeläinen, Torben Hansen (2020-12-22; genetics  /​ ​​ ​correlation):

    Observational, cross-sectional and longitudinal studies showed that physical activity and sedentary behaviour are associated with adiposity-related traits, apparently in a bidirectional manner. Physical activity is also suggested to suppress the genetic risk of adiposity.

    Since phenotypic associations with genetic variants are not subject to or confounding, they may be used as tools to shed light on cause and effect in this complex interdependency. We review the evidence for shared genetics of physical activity and adiposity-related traits and for gene-by-physical activity interactions on adiposity-related traits in human studies. We outline limitations, challenges and opportunities in studying and understanding of these relationships.

    In summary, physical activity and sedentary behaviour are genetically correlated with and fat percentage but may not be correlated with lean body mass. Mendelian randomisation analyses show that physical activity and sedentary behaviour have bidirectional relationships with adiposity. Several studies suggest that physical activity suppresses genetic risk of adiposity. No studies have yet tested whether adiposity enhances genetic predisposition to sedentariness.

    The complexity of the comprehensive causal model makes the assessment of the single or combined components challenging. Substantial progress in this field may need long-term intervention studies.

    [Keywords: adiposity, genetic determinants, physical activity, sedentary behaviour]

  61. ⁠, Oliver Pain, Karen Hodgson, Vassily Trubetskoy, Stephan Ripke, Victoria S. Marshe, Mark J. Adams, Enda M. Byrne, Adrian I. Campos, Tania Carrillo-Roa, Annamaria Cattaneo, Thomas Damm Als, Daniel Souery, Mojca Z. Dernovsek, Chiara Fabbri, Caroline Hayward, Neven Henigsberg, Joanna Hauser, James L. Kennedy, Eric J. Lenze, Glyn Lewis, Daniel J. Müller, Nicholas G. Martin, Benoit H. Mulsant, Ole Mors, Nader Perroud, David J. Porteous, Miguel E. Rentería, Charles F. Reynolds, Marcella Rietschel, Rudolf Uher, Eleanor M. Wigmore, Wolfgang Maier, Naomi R. Wray, Katherine J. Aitchison, Volker Arolt, Bernhard T. Baune, Joanna M. Biernacka, Guido Bondolfi, Katharina Domschke, Masaki Kato, Qingqin S. Li, Yu-Li Liu, Alessandro Serretti, Shih-Jen Tsai, Gustavo Turecki, Richard Weinshilboum, the GSRD Consortium, the Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium, Andrew M. McIntosh, Cathryn M. Lewis (2020-12-15):

    Importance: Antidepressants are a first line treatment for depression. However, only a third of individuals remit after the first treatment. Genetic variation likely regulates antidepressant response, yet the success of previous genome-wide association studies has been limited by sample size.

    Objective: Gain insight into underlying biology of antidepressant response, characterize -based heritability and genetic overlap with related outcomes, and evaluate out-of-sample prediction using polygenic scores.

    Design: Genome-wide of antidepressant response measures, Remission and Percentage Improvement in depression scores.

    Setting: Multiple international recruitment sites, including clinical trial and open label studies.

    Participants: Diagnosed with Major Depressive Disorder and assessed for depressive symptoms before and after prescription of an antidepressant medication.

    Main Outcome(s) and Measure(s): Antidepressant response measured as Remission and Percentage Improvement.

    Results: Genome-wide analysis of Remission (nremit = 1,852, nnon-remit = 3,299) and Percentage Improvement (n = 5,218) identified no genome-wide statistically-significant variants. The heritability from common variants was statistically-significantly different from zero for Remission (h2 = 0.132, SE = 0.056), but not Percentage Improvement (h2 = −0.018, SE = 0.032). analysis showed better antidepressant response was associated with lower genetic risk for ⁠, and higher genetic propensity for educational attainment. Polygenic scores for antidepressant response demonstrated weak but evidence of out-of-sample prediction across cohorts, though results varied in external cohorts.

    Conclusions and Relevance: This study demonstrates antidepressant response is influenced by common genetic variation, has a genetic overlap with schizophrenia and educational attainment, and provides a useful resource for future research. Larger sample sizes are required to attain the potential of genetics for understanding and predicting antidepressant response.

    Question: What is the genetic architecture of antidepressant response, and how is it associated with other traits?

    Findings: This of antidepressant response finds Remission SNP-based heritability was statistically-significantly different from zero for Remission (h2 = 0.132, SE = 0.056), but not Percentage Improvement (h2 = −0.018, SE = 0.032). Polygenic score analysis showed better antidepressant response was associated with lower genetic risk for schizophrenia, and higher genetic propensity for educational attainment.

    Meaning: This study demonstrates antidepressant response is influenced by common genetic variation, has a genetic overlap with schizophrenia and educational attainment, and provides a useful resource for future research.

  62. ⁠, Guillaume Huguet, Catherine Schramm, Elise Douard, Tamer Petra, Antoine Main, Pauline Monin, Jade England, Khadije Jizi, Thomas Renne, Myriam Poirier, Sabrina Nowak, Charles-Olivier Martin, Nadine Younis, Inga Sophia Knoth, Martineau Jean-Louis, Zohra Saci, Maude Auger, Frédérique Tihy, Géraldine Mathonnet, Catalina Maftei, France Léveillé, David Porteous, Gail Davies, Paul Redmond, Sarah E. Harris, W. David Hill, Emmanuelle Lemyre, Gunter Schumann, Thomas Bourgeron, Zdenka Pausova, Tomas Paus, Sherif Karama, Sarah Lippe, Ian J. Deary, Laura Almasy, Aurélie Labbe, David Glahn, Celia M. T. Greenwood, Sébastien Jacquemont (2020-10-05):

    Genomic Copy Number Variants () are routinely identified and reported back to patients with neuropsychiatric disorders, but their quantitative effects on essential traits such as cognitive ability are poorly documented. We have recently shown that the effect-size of deletions on cognitive ability can be statistically predicted using measures of intolerance to haploinsufficiency. However, the of duplications remain unknown. It is also unknown if the effect of multigenic CNVs are driven by a few genes intolerant to haploinsufficiency or distributed across tolerant genes as well.

    Here, we identified all CNVs >50 kilobases in 24,092 individuals from unselected and autism cohorts with assessments of general intelligence. Statistical models used measures of intolerance to haploinsufficiency of genes included in CNVs to predict their effect-size on intelligence. Intolerant genes decrease general intelligence by 0.8 and 2.6 points of IQ when duplicated or deleted, respectively. Effect-sizes showed no heterogeneity across cohorts. Validation analyses demonstrated that models could predict CNV effect-sizes with 78% accuracy. Data on the inheritance of 27,766 CNVs showed that deletions and duplications with the same effect-size on intelligence occur de novo at the same frequency.

    We estimated that around 10,000 intolerant and tolerant genes negatively affect intelligence when deleted, and less than 2% have large effect-sizes. Genes encompassed in CNVs were not enriched in any GOterms but gene regulation and brain expression were GOterms overrepresented in the intolerant subgroup. Such pervasive effects on cognition may be related to emergent properties of the genome not restricted to a limited number of biological pathways.

  63. ⁠, Nasa Sinnott-Armstrong, Sahin Naqvi, Manuel Rivas, Jonathan K. Pritchard (2021-01-12):

    Genome-wide association studies (GWAS) have been used to study the genetic basis of a wide variety of complex diseases and other traits. We describe UK Biobank GWAS results for three molecular traits—urate, IGF-1, and testosterone—with better-understood biology than most other complex traits. We find that many of the most significant hits are readily and surprisingly interpretable. We observe huge enrichment of associations near genes involved in the relevant biosynthesis, transport, or signaling pathways. We show how GWAS data illuminate the biology of each trait, including differences in testosterone regulation between females and males. At the same time, even these molecular traits are highly polygenic, with many thousands of variants spread across the genome contributing to trait variance. In summary, for these three molecular traits we identify strong enrichment of signal in putative core gene sets, even while most of the SNP-based heritability is driven by a massively polygenic background.

  64. ⁠, Pere Gelabert, Susanna Sawyer, Anders Bergström, Thomas C. Collin, Tengiz Meshveliani, Anna Belfer-Cohen, David Lordkipanidze, Nino Jakeli, Zinovi Matskevich, Guy Bar-Oz, Daniel M. Fernandes, Olivia Cheronet, Kadir T. Özdoğan, Victoria Oberreiter, Robin N. M. Feeney, Mareike C. Stahlschmidt, Pontus Skoglund, Ron Pinhasi (2021-01-08):

    Archaeological sediments have been shown to preserve ancient DNA, but so far have not yielded genome-scale information of the magnitude of skeletal remains. We retrieved and analysed human and mammalian low-coverage nuclear and high-coverage mitochondrial genomes from Upper Palaeolithic sediments from Satsurblia cave, western Georgia, dated to 25,000 years ago. First, a human female genome with substantial basal Eurasian ancestry, which was an ancestry component of the majority of post-Ice Age people in the Near East, North Africa, and parts of Europe. Second, a wolf genome that is basal to extant Eurasian wolves and dogs and represents a previously unknown, likely extinct, Caucasian lineage that diverged from the ancestors of modern wolves and dogs before these diversified. Third, a bison genome that is basal to present-day populations, suggesting that population structure has been substantially reshaped since the Last Glacial Maximum. Our results provide new insights into the late Pleistocene genetic histories of these three species, and demonstrate that sediment DNA can be used not only for species identification, but also be a source of genome-wide ancestry information and genetic history.

    Highlights: We demonstrate for the first time that genome sequencing from sediments is comparable to that of skeletal remains

    A single Pleistocene sediment sample from the Caucasus yielded three low-coverage mammalian ancient genomes

    We show that sediment ancient DNA can reveal important aspects of the human and faunal past

    Evidence of an uncharacterized human lineage from the Caucasus before the Last Glacial Maximum

    ~0.01× coverage wolf and bison genomes are both basal to present-day diversity, suggesting reshaping of population structure in both species

  65. ⁠, Carolina Carvalho, Frank Wendt, Gita A. Pathak, Adam Maihofer, Dan Stein, Jennifer Sumner, Sian Hemmings, Caroline Nievergelt, Karestan Koenen, Joel Gelernter, Sintia Belangero, Renato Polimanti (2021-01-26):

    There is a well-known association of post-traumatic stress disorder () and traumatic experiences with body size and composition, including consistent differences between sexes. However, the biology underlying these associations is unclear.

    To understand this complex relationship, we investigated large-scale datasets from the Psychiatric Genomic Consortium (12 823 cases and 35 648 controls), the (up to 360 000 individuals), and the GIANT (Genetic Investigation of Anthropometric Traits) Consortium (up to 339 224 individuals). We used genome-wide association statistics to estimate sex-specific genetic correlations (rg) among PTSD, traumatic experiences, social support, and multiple anthropometric traits.

    After multiple testing corrections (false discovery rate, FDR q<0.05), we observed 58 statistically-significant rg relationships in females (eg., childhood physical abuse and body mass index, BMI rg = 0.245, p = 3.88×10−10) and 21 statistically-significant rg relationships in males (eg., been involved in combat or exposed to warzone and leg fat percentage; rg = 0.405, p = 4.42×10−10). We performed causal inference analyses of these genetic overlaps using Mendelian randomization and latent causal variable approaches. Multiple female-specific putative causal relationships were observed linking body composition/​​​​size with PTSD (eg., leg fat percentage→PTSD; beta = 0.319, p = 3.13×10−9), traumatic experiences (eg., childhood physical abuse→waist circumference; β = 0.055, p = 5.07×10−4), and childhood neglect (eg., ‘someone to take you to doctor when needed as a child’BMI; β = −0.594, p = 1.09×10−5). In males, we observed putative causal effects linking anthropometric-trait genetic liabilities to traumatic experiences (eg., BMI→childhood physical abuse; β = 0.028, p = 8.19×10−3).

    In conclusion, our findings provide insights regarding sex-specific causal networks linking anthropometric traits to PTSD, traumatic experiences, and social support.

  66. ⁠, Luke J. O'Connor, Alkes L. Price (2018-10-29):

    (MR), a method to infer causal relationships, is confounded by genetic correlations reflecting shared etiology.

    We developed a model in which a latent causal variable (LCV) mediates the genetic correlation; trait 1 is partially genetically causal for trait 2 if it is strongly genetically correlated with the LCV, quantified using the genetic causality proportion (gcp).

    We fit this model using mixed fourth E(α[^2^~1~]{.supsub}α1α2) and E(α[^2^~2~]{.supsub}α1α2) of marginal effect sizes for each trait; if trait 1 is causal for trait 2, then SNPs affecting trait 1 (large α[^2^~1~]{.supsub}) will have correlated effects on trait 2 (large α1α2), but not vice versa. In simulations, our method avoided false positives due to genetic correlations, unlike MR.

    Across 52 traits (average n = 331k), we identified 30 causal relationships with high gcp estimates. Novel findings included a causal effect of LDL on bone mineral density, consistent with clinical trials of statins in osteoporosis.

  67. 2021-pereira.pdf: ⁠, Luisa Pereira, Leon Mutesa, Paulina Tindana, Michele Ramsay (2021-01-11; genetics  /​ ​​ ​selection):

    The deep evolutionary history of African populations, since the emergence of modern humans more than 300,000 years ago, has resulted in high genetic diversity and considerable population structure. Selected genetic variants have increased in frequency due to environmental adaptation, but recent exposures to novel pathogens and changes in lifestyle render some of them with properties leading to present health liabilities. The unique discoverability potential from African genomic studies promises invaluable contributions to understanding the genomic and molecular basis of health and disease. Globally, African populations are understudied, and precision medicine approaches are largely based on data from European and Asian-ancestry populations, which limits the transferability of findings to the continent of Africa. Africa needs innovative precision medicine solutions based on African data that use knowledge and implementation strategies aligned to its climatic, cultural, economic and genomic diversity.

  68. ⁠, Mary Lauren Benton, Abin Abraham, Abigail L. LaBella, Patrick Abbot, Antonis Rokas, John A. Capra (2021-01-06):

    Nearly all genetic variants that influence disease risk have human-specific origins; however, the systems they influence have ancient roots that often trace back to evolutionary events long before the origin of humans. Here, we review how advances in our understanding of the genetic architectures of diseases, recent human evolution and deep evolutionary history can help explain how and why humans in modern environments become ill.

    Human populations exhibit differences in the prevalence of many common and rare genetic diseases. These differences are largely the result of the diverse environmental, cultural, demographic and genetic histories of modern human populations. Synthesizing our growing knowledge of evolutionary history with genetic medicine, while accounting for environmental and social factors, will help to achieve the promise of personalized genomics and realize the potential hidden in an individual’s DNA sequence to guide clinical decisions.

    In short, precision medicine is fundamentally evolutionary medicine, and integration of evolutionary perspectives into the clinic will support the realization of its full potential.

  69. ⁠, Stephanie M. Yan, Rachel M. Sherman, Dylan J. Taylor, Divya R. Nair, Andrew N. Bortvin, Michael C. Schatz, Rajiv C. McCoy (2021-01-26):

    Large genomic insertions, deletions, and inversions are a potent source of functional and fitness-altering variation, but are challenging to resolve with short-read DNA sequencing alone. While recent long-read sequencing technologies have greatly expanded the catalog of structural variants (SVs), their costs have so far precluded their application at population scales. Given these limitations, the role of SVs in human adaptation remains poorly characterized.

    Here, we used a graph-based approach to genotype 107,866 long-read-discovered SVs in short-read sequencing data from diverse human populations. We then applied an admixture-aware method to scan these SVs for patterns of population-specific frequency differentiation—a signature of local adaptation.

    We identified 220 SVs exhibiting extreme frequency differentiation, including several SVs that were among the lead variants at their corresponding loci. The top two signatures traced to separate insertion and deletion polymorphisms at the immunoglobulin heavy chain locus, together tagging a 325 Kbp haplotype that swept to high frequency and was subsequently fragmented by ⁠. Alleles defining this haplotype are nearly fixed (60-95%) in certain Southeast Asian populations, but are rare or absent from other global populations composing the 1000 Genomes Project.

    Further investigation revealed that the haplotype closely matches with sequences observed in two of three high-coverage Neanderthal genomes, providing strong evidence of a Neanderthal-introgressed origin. This extraordinary episode of positive selection, which we infer to have occurred between 1700 and 8400 years ago, corroborates the role of immune-related genes as prominent targets of adaptive archaic introgression.

    Our study demonstrates how combining recent advances in genome sequencing, genotyping algorithms, and methods can reveal signatures of key evolutionary events that remained hidden within poorly resolved regions of the genome.

  70. ⁠, Isain Zapata, Erin E. Hecht, James A. Serpell, Carlos E. Alvarez (2021-01-01):

    Genetic studies show a general factor associated with all human psychopathology and strongly correlated with personality and intelligence, but its basis is unknown. We performed genome scans of 17 normal and problem behaviors in three multi-breed dog cohorts. 21 of 90 mapped loci were supported for the same, or a related, trait in a second cohort. Several of those loci were also associated with brain structure differences across breeds; and six of the respective top-candidate genes are also associated with human brain structure and function. More broadly, the geneset of canine behavioral scans is supported by enrichment for genes mapped for human behavior, personality, cognition, psychopathology and brain structure. The biology implicated includes, neurogenesis, axon guidance, angiogenesis, brain structure, alternative splicing, disease association, Hox-family transcription factors, and subiculum expression. Because body size and behavior are correlated in dogs, we isolated the effect of body size in the dog mapping and in the comparative human UK Biobank analyses. Our dog findings are consistent with pleiotropy of diverse brain traits with energy metabolism and growth, and suggest behavioral variations often affect neurogenesis. There is support for such pleiotropy in humans and well-powered genetic studies of human psychiatric traits consistently implicate neurogenesis. We propose a genetic network which underlies neuron birth and development throughout life is associated with evolutionary adaptation of behavior and the general psychopathology factor. This understanding has implications for genetic and environmental contributions to psychiatric disease. We discuss how canine translational models can further accelerate the study of psychopathology.

    Author summary

    We genetically mapped diverse normal and problem behaviors in dogs. The well-established approach we used is ideally suited for finding variation that is common across dog breeds and for pin-pointing the most likely gene candidates. Our analysis of the genes implicated at 90 genome regions shows they are enriched for i) genes mapped for diverse brain functions and pathologies in humans; ii) genes involved in brain development throughout life; and iii) footprints of evolution in dogs, humans and other animals. We propose that is consistent with evolutionary conservation of the general genetic factor of mental health in humans, which is correlated with personality and intelligence. The implications are that this super-network of genes is preferentially targeted by evolutionary adaptation for behavior and that its dysregulation increases risk of mental health disorders.

  71. ⁠, David Hugh-Jones, Abdel Abdellaoui (2021; economics⁠, genetics  /​ ​​ ​selection  /​ ​​ ​dysgenics):

    Natural selection has been documented in contemporary humans, but little is known about the mechanisms behind it. We test for natural selection through the association between 33 polygenic scores and fertility, across two generations, using data from UK Biobank (n = 409,629 British subjects with European ancestry).

    Consistently over time, polygenic scores associated with lower (higher) earnings, education and health are selected for (against). Selection effects are concentrated among lower SES groups, younger parents, people with more lifetime sexual partners, and people not living with a partner. The direction of natural selection is reversed among older parents (22+), or after controlling for age at first live birth. These patterns are in line with economic theories of fertility, in which higher earnings may either increase or decrease fertility via income and substitution effects in the labour market.

    Studying can help us understand the genetic architecture of health outcomes: we find evidence in modern day Great Britain for multiple natural selection pressures that vary between subgroups in the direction and strength of their effects, that are strongly related to the socio-economic system, and that may contribute to health inequalities across income groups.

    Figure 1: Mean polygenic scores (PGS) by birth year in UK Biobank. Points are means for 5-year intervals. Lines are 95% confidence intervals. Green triangles show a statistically-significant linear increase over time (p < (0.05/​​​​33)). Red squares show a statistically-significant decrease.
    Figure 7: Mean polygenic score for educational attainment (EA3) of children by household income group. Blue is actual. Grey is hypothetical in the absence of selection effects.
  72. ⁠, Elizabeth A. Landis, Angela M. Oliverio, Erin A. McKenney, Lauren M. Nichols, Nicole Kfoury, Megan Biango-Daniels, Leonora K. Shell, Anne A. Madden, Lori Shapiro, Shravya Sakunala, Kinsey Drake, Albert Robbat, Matthew Booker, Robert R. Dunn, Noah Fierer, Benjamin E. Wolfe (2021-01-26):

    Humans have relied on sourdough starter microbial communities to make leavened bread for thousands of years, but only a small fraction of global sourdough biodiversity has been characterized. Working with a community-scientist network of bread bakers, we determined the microbial diversity of 500 sourdough starters from four continents. In sharp contrast with widespread assumptions, we found little evidence for biogeographic patterns in starter communities. Strong co-occurrence patterns observed in situ and recreated in vitro demonstrate that microbial interactions shape sourdough community structure. Variation in dough rise rates and aromas were largely explained by acetic acid bacteria, a mostly overlooked group of sourdough microbes. Our study reveals the extent of microbial diversity in an ancient fermented food across diverse cultural and geographic backgrounds.

    eLife digest: Sourdough bread is an ancient fermented food that has sustained humans around the world for thousands of years. It is made from a sourdough ‘starter culture’ which is maintained, portioned, and shared among bread bakers around the world. The starter culture contains a community of microbes made up of yeasts and bacteria, which ferment the carbohydrates in flour and produce the carbon dioxide gas that makes the bread dough rise before baking.

    The different acids and enzymes produced by the microbial culture affect the bread’s flavor, texture and shelf life. However, for such a dependable staple, sourdough bread cultures and the mixture of microbes they contain have scarcely been characterized. Previous studies have looked at the composition of starter cultures from regions within Europe. But there has never been a comprehensive study of how the microbial diversity of sourdough starters varies across and between continents.

    To investigate this, Landis, Oliverio et al. used genetic sequencing to characterize the microbial communities of sourdough starters from the homes of 500 bread bakers in North America, Europe and Australasia. Bread makers often think their bread’s unique qualities are due to the local environment of where the sourdough starter was made. However, Landis, Oliverio et al. found that geographical location did not correlate with the diversity of the starter cultures studied. The data revealed that a group of microbes called acetic acid bacteria, which had been overlooked in past research, were relatively common in starter cultures. Moreover, starters with a greater abundance of this group of bacteria produced bread with a strong vinegar aroma and caused dough to rise at a slower rate.

    This research demonstrates which species of bacteria and yeast are most commonly found in sourdough starters, and suggests geographical location has little influence on the microbial diversity of these cultures. Instead, the diversity of microbes likely depends more on how the starter culture was made and how it is maintained over time.

  73. 2021-koblan.pdf: ⁠, Luke W. Koblan, Michael R. Erdos, Christopher Wilson, Wayne A. Cabral, Jonathan M. Levy, Zheng-Mei Xiong, Urraca L. Tavarez, Lindsay M. Davison, Yantenew G. Gete, Xiaojing Mao, Gregory A. Newby, Sean P. Doherty, Narisu Narisu, Quanhu Sheng, Chad Krilow, Charles Y. Lin, Leslie B. Gordon, Kan Cao, Francis S. Collins, Jonathan D. Brown, David R. Liu (2021-01-06; genetics  /​ ​​ ​editing):

    Hutchinson-Gilford progeria syndrome (HGPS or progeria) is typically caused by a dominant-negative C•G-to-T•A mutation (c​).1824 C>T; p.G608G) in LMNA, the gene that encodes nuclear lamin A. This mutation causes RNA mis-splicing that produces progerin, a toxic protein that induces rapid ageing and shortens the lifespan of children with progeria to approximately 14 years. Adenine base editors (ABEs) convert targeted A•T base pairs to G•C base pairs with minimal by-products and without requiring double-strand DNA breaks or donor DNA templates. Here we describe the use of an ABE to directly correct the pathogenic HGPS mutation in cultured fibroblasts derived from children with progeria and in a mouse model of HGPS. Lentiviral delivery of the ABE to fibroblasts from children with HGPS resulted in 87–91% correction of the pathogenic allele, mitigation of RNA mis-splicing, reduced levels of progerin and correction of nuclear abnormalities. Unbiased off-target DNA and RNA editing analysis did not detect off-target editing in treated patient-derived fibroblasts. In transgenic mice that are for the human LMNA c​).1824 C>T allele, a single retro-orbital injection of adeno-associated virus 9 (AAV9) encoding the ABE resulted in substantial, durable correction of the pathogenic mutation (around 20–60% across various organs six months after injection), restoration of normal RNA splicing and reduction of progerin protein levels. In vivo base editing rescued the vascular pathology of the mice, preserving vascular smooth muscle cell counts and preventing adventitial fibrosis. A single injection of ABE-expressing AAV9 at postnatal day 14 improved vitality and greatly extended the median lifespan of the mice from 215 to 510 days. These findings demonstrate the potential of in vivo base editing as a possible treatment for HGPS and other genetic diseases by directly correcting their root cause.

  74. ⁠, Timothy G. Raben, Louis Lello, Erik Widen, Stephen D. H. Hsu (2021-01-14):

    Decoding the genome confers the capability to predict characteristics of the organism(phenotype) from DNA (genotype). We describe the present status and future prospects of genomic prediction of complex traits in humans. Some highly heritable complex phenotypes such as height and other quantitative traits can already be predicted with reasonable accuracy from DNA alone. For many diseases, including important common conditions such as coronary artery disease, breast cancer, type I and II diabetes, individuals with outlier polygenic scores (e.g., top few percent) have been shown to have 5 or even 10 times higher risk than average. Several psychiatric conditions such as schizophrenia and autism also fall into this category. We discuss related topics such as the genetic architecture of complex traits, sibling validation of polygenic scores, and applications to adult health, in vitro fertilization (embryo selection), and genetic engineering.

  75. ⁠, Sean M. Carroll (2021-01-19):

    Effective Field Theory (EFT) is the successful paradigm underlying modern theoretical physics, including the “Core Theory” of the Standard Model of particle physics plus Einstein’s general relativity. I will argue that EFT grants us a unique insight: each EFT model comes with a built-in specification of its domain of applicability. Hence, once a model is tested within some domain (of energies and interaction strengths), we can be confident that it will continue to be accurate within that domain.

    Currently, the Core Theory has been tested in regimes that include all of the energy scales relevant to the physics of everyday life (biology, chemistry, technology, etc). Therefore, we have reason to be confident that the laws of physics underlying the phenomena of everyday life are completely known.

    Figure 4: Limits on a new fifth force, in terms of its strength relative to gravity, as a function of its range. Adapted from data collected in Adelberger et al 2009. This is a rough reconstruction; see original source for details.
  76. 2009-adelberger.pdf: ⁠, E. G. Adelberger, J. H. Gundlach, B. R. Heckel, S. Hoedl, S. Schlamminger (2009; science):

    We review recent mechanical experiments that test some of the most basic principles of physics including the weak and strong forms of the ⁠. The very high sensitivity of these tests allows one to place interesting constraints on string-theory inspired conjectures about new forces from the exchange of very light scalar, pseudo-scalar or vector particles, large extra dimensions, the ⁠, non-commutative spacetime geometry, and Planck-scale Lorentz violation.

    [Keywords: ⁠, equivalence principle, inverse square law, Lorentz invariance]

  77. ⁠, V Pavlovic, T. Weissgerber, D. Stanisavljevic, T. Pekmezovic, V. Garovic, N. Milic, CITE Investigators (2020-12-10):

    Citations are an important, but often overlooked, part of every scientific paper. They allow the reader to trace the flow of evidence, serving as a gateway to relevant literature. Most scientists are aware of citations errors, but few appreciate the prevalence or consequences of these problems. The purpose of this study was to examine how often frequently cited papers in biomedical scientific literature are cited inaccurately. The study included an active participation of first authors of frequently cited papers; to first-hand verify the citations accuracy. The approach was to determine most cited original articles and their parent authors, that could be able to access, and identify, collect and review all citations of their original work. Findings from feasibility study, where we collected and reviewed 1,540 articles containing 2,526 citations of 14 most cited articles in which the 1st authors were affiliated with the Faculty of Medicine University of Belgrade, were further evaluated for external confirmation in an independent verification set of articles. Verification set included 4,912 citations identified in 2,995 articles that cited 13 most cited articles published by authors affiliated with the Mayo Clinic Division of Nephrology and Hypertension (Rochester, Minnesota, USA), whose research focus is hypertension and peripheral vascular disease. Most cited articles and their citations were determined according to SCOPUS database search. A citation was defined as being accurate if the cited article supported or was in accordance with the statement by citing authors. A multilevel regression model for binary data was used to determine predictors of inaccurate citations. At least one inaccurate citation was found in 11% and 15% of articles in the feasibility study and verification set, respectively, suggesting that inaccurate citations are common in biomedical literature. The main findings were similar in both sets. The most common problem was the citation of nonexistent findings (38.4%), followed by an incorrect interpretation of findings (15.4%). One fifth of inaccurate citations were due to “chains of inaccurate citations,” in which inaccurate citations appeared to have been copied from previous papers. Reviews, longer time elapsed from publication to citation, and multiple citations were associated with higher chance of citation being inaccurate. Based on these findings, several actions that authors, mentors and journals can take to reduce citation inaccuracies and maintain the integrity of the scientific literature have been proposed.

  78. ⁠, Erik D. Demaine, Jayson Lynch, Geronimo J. Mirano, Nirvan Tyagi (2016-05-26):

    We initiate the systematic study of the energy complexity of algorithms (in addition to time and space complexity) based on Landauer’s Principle in physics, which gives a lower bound on the amount of energy a system must dissipate if it destroys information. We propose energy-aware variations of three standard models of computation: circuit RAM, word RAM, and transdichotomous RAM. On top of these models, we build familiar high-level primitives such as control logic, memory allocation, and garbage collection with zero energy complexity and only constant-factor overheads in space and time complexity, enabling simple expression of energy-efficient algorithms. We analyze several classic algorithms in our models and develop low-energy variations: comparison sort, insertion sort, counting sort, breadth-first search, Bellman-Ford, Floyd-Warshall, matrix all-pairs shortest paths, AVL trees, binary heaps, and dynamic arrays. We explore the time/​​​​space/​​​​energy trade-off and develop several general techniques for analyzing algorithms and reducing their energy complexity. These results lay a theoretical foundation for a new field of semi-reversible computing and provide a new framework for the investigation of algorithms.

  79. 2006-smith.pdf: ⁠, James E. Smith, Robert L. Winkler (2006-03-01; statistics  /​ ​​ ​decision):

    Decision analysis produces measures of value such as expected net present values or expected utilities and ranks alternatives by these value estimates. Other optimization-based processes operate in a similar manner. With uncertainty and limited resources, an analysis is never perfect, so these value estimates are subject to error. We show that if we take these value estimates at face value and select accordingly, we should expect the value of the chosen alternative to be less than its estimate, even if the value estimates are unbiased. Thus, when comparing actual outcomes to value estimates, we should expect to be disappointed on average, not because of any inherent bias in the estimates themselves, but because of the optimization-based selection process. We call this phenomenon the optimizer’s curse and argue that it is not well understood or appreciated in the decision analysis and management science communities. This curse may be a factor in creating skepticism in decision makers who review the results of an analysis. In this paper, we study the optimizer’s curse and show that the resulting expected disappointment may be substantial. We then propose the use of to adjust value estimates. These Bayesian methods can be viewed as disciplined skepticism and provide a method for avoiding this postdecision disappointment.

  80. Regression

  81. ⁠, Xavier Marquez (2020-08-13; sociology  /​ ​​ ​preference-falsification):

    This chapter argues that leader personality cults are typically produced by a specific set of mechanisms of flattery inflation. It describes how loyalty signaling, emotional amplification, and direct production mechanisms can combine, under specific circumstances, to transform ordinary flattery into full-blown practices of ruler worship. And it argues for attending to the specific conditions that make possible the operation of these mechanisms, showing how patronage relationships in particular provide fertile ground for the emergence of personality cults. Moreover, the chapter argues that both ancient and modern leader cults depend on similar mechanisms, despite clear differences in context and function. I illustrate the operation of these mechanisms with many modern examples and an extended discussion of one ancient example, the abortive cult of Caligula during the Roman Principate.

    [Keywords: personality cults, Caligula, flattery inflation, Hugo Chávez, Mao Zedong, Stalin]

  82. Abandoned-Footnotes

  83. 1999-dawson.pdf: ⁠, Lorne L. Dawson (1999-10-01; sociology):

    Almost everyone in the sociology of religion is familiar with the classic of how religious groups respond to the failure of their prophetic pronouncements. Far fewer are aware of the many other studies of a similar nature completed over the last thirty years on an array of other new religious movements. There are intriguing variations in the observations and conclusions advanced by many of these studies, as well as some surprising commonalities. This paper offers a systematic overview of these variations and commonalities with an eye to developing a more comprehensive and critical perspective on this complex issue. An analysis is provided of the adaptive strategies of groups faced with a failure of prophecy and the conditions affecting the nature and relative success of these strategies. In the end, it is argued, the discussion would benefit from a conceptual reorientation away from the specifics of the theory of cognitive dissonance, as formulated by Festinger et al to a broader focus on the generic processes of dissonance management in various religious and social groups.

  84. ⁠, Robin Hanson (2020-11-29):

    Just as authors focus on telling stories in familiar spaces with familiar minds, they also focus on telling stories in familiar moral universes. This effect is, if anything, even stronger than the space and mind effects, as moral colors are even more central to our need for stories. Compared to other areas of our lives, we especially want our stories to help us examine and affirm our moral stances…These are the familiar sorts of “moral ambiguity” in stories said to have that feature, such as The Sopranos or Game of Thrones. But you’ll note that these are almost all stories told in familiar moral universes. By which I mean that we are quite familiar with how to morally evaluate the sort of actions that happen there. The set of acts is familiar, as are their consequences, and the moral calculus used to judge them.

    But there is another sort of “moral ambiguity” that reader/​​​​viewers hate, and so authors studiously avoid. And that is worlds where we find it hard to judge the morality of actions, even when those actions have big consequences for characters. Where our usual quick and dirty moral language doesn’t apply very well. Where even though in principle our most basic and general moral languages might be able to work out rough descriptions and evaluations, in practice that would be tedious and unsatisfying.

    And, strikingly, the large complex social structures and organizations that dominate our world are mostly not familiar moral universes to most of us. For example, big firms, agencies, and markets. The worlds of Moral Mazes and of Pfeffer’s Power⁠. (In fiction: ⁠.) Our stories thus tend to avoid such contexts, unless they happen to allow an especially clear moral calculus. Such as a firm polluting to cause cancer, or a boss sexually harassing a subordinate.

    This is why our stories tend to take place in relatively old fashioned social worlds. Consider the popularity of the Western, or of pop science fiction stories like Star Wars that are essentially Westerns with more gadgets. Stories that take place in modern settings tend to focus on personal, romantic, and family relations, as these remain to us relatively familiar moral universes. Or on artist biopics. Or on big conflicts like war or corrupt police or politicians. For which we have comfortable moral framings. Stories we write today set in say the 1920s feel to us more comfortable than do stories set in the 2020s, or than stories written in the 1920s and set in that time. That is because stories written today can inherit a century of efforts to work out clearer moral stances on which 1920s actions would be more moral. For example, as to our eyes female suffrage is clearly good, we can see any characters from then who doubted it as clearly evil in the eyes of good characters. As clear as if they tortured kittens. To our eyes, their world has now clearer moral colors, and stories set there work better as stories for us.

    …This highlights an important feature of our modern world, and an important process that continues within it. Our social world has changed a lot faster than has our shared moral evaluations of typical actions possible in our new world. And our telling stories, and coming to agree on which stories we embrace, is a big part of creating such a fluid language of shared moral evaluations.

    This helps to explain why we invest so much time and energy into fiction, far more than did any of our ancestors. Why story tellers are given high and activist-like status, and why we fight so much to convince others to share our beliefs on which stories are best.

  85. https://astralcodexten.substack.com/p/still-alive

  86. ⁠, Mayank Agrawal, Marcelo G. Mattar, Jonathan D. Cohen, Nathaniel D. Daw (2020-09-09):

    Cognitive fatigue and boredom are two phenomenological states widely associated with limitations in cognitive control. In this paper, we present a rational analysis of the temporal structure of controlled behavior, which provides a new framework for providing a formal account of these phenomena. We suggest that in controlling behavior, the brain faces competing behavioral and computational imperatives, and must balance them by tracking their opportunity costs over time. We use this analysis to flesh out previous suggestions that feelings associated with subjective effort, like cognitive fatigue and boredom, are the phenomenological counterparts of these opportunity cost measures, rather then reflecting the depletion of resources as has often been assumed. Specifically, we propose that both fatigue and boredom reflect the competing value of particular options that require foregoing immediate reward but can improve future performance: Fatigue reflects the value of offline computation (internal to the organism) to improve future decisions, while boredom signals the value of exploratory actions (external in the world) to gather information. We demonstrate that these accounts provide a mechanistically explicit and parsimonious account for a wide array of findings related to cognitive control, integrating and reimagining them under a single, formally rigorous framework.

  87. ⁠, Baptiste Couvy-Duchesne, Lachlan T. Strike, Futao Zhang, Yan Holtz, Zhili Zheng, Kathryn E. Kemper, Loic Yengo, Olivier Colliot, Margaret J. Wright, Naomi R. Wray, Jian Yang, Peter M. Visscher (2020-07-20):

    The recent availability of large-scale neuroimaging cohorts facilitates deeper characterisation of the relationship between phenotypic and brain architecture variation in humans. Here, we investigate the association (previously coined morphometricity) of a phenotype with all 652,283 vertex-wise measures of cortical and subcortical morphology in a large data set from the UK Biobank (UKB; n = 9,497 for discovery, n = 4,323 for replication) and the Human Connectome Project (n = 1,110).

    We used a with the brain measures of individuals fitted as random effects with covariance relationships estimated from the imaging data. We tested 167 behavioural, cognitive, psychiatric or lifestyle phenotypes and found statistically-significant morphometricity for 58 phenotypes (spanning substance use, blood assay results, education or income level, diet, depression, and cognition domains), 23 of which replicated in the UKB replication set or the HCP. We then extended the model for a bivariate analysis to estimate grey-matter correlation between phenotypes, which revealed that body size (ie., height, weight, BMI, waist and hip circumference, body fat percentage) could account for a substantial proportion of the morphometricity (confirmed using a conditional analysis), providing possible insight into previous MRI results for psychiatric disorders where case status is associated with body mass index. Our LMM framework also allowed to predict some of the associated phenotypes from the vertex-wise measures, in two independent samples. Finally, we demonstrated additional new applications of our approach: (a) region of interest (ROI) analysis that retain the vertex-wise complexity; (b) comparison of the information retained by different MRI processings.

  88. Variance-components

  89. 2018-fassnidge.pdf: ⁠, Christopher J. Fassnidge, Elliot D. Freeman (2018-06-01; psychology):

    Some people hear what they see: car indicator lights, flashing neon shop signs, and people’s movements as they walk may all trigger an auditory sensation, which we call the visual-evoked auditory response (vEAR or ‘visual ear’). We have conducted the first large-scale online survey (n > 4000) of this little-known phenomenon. We analysed the prevalence of vEAR, what induces it, and what other traits are associated with it.

    We assessed prevalence by asking whether respondents had previously experienced vEAR. Participants then rated silent videos for vividness of evoked auditory sensations, and answered additional trait questions.

    Prevalence appeared higher relative to other typical synaesthesias. Prior awareness and video ratings were associated with greater frequency of other synaesthesias, including flashes evoked by sounds, and musical imagery. Higher-rated videos often depicted meaningful events that predicted sounds (eg., collisions). However, even videos containing abstract flickering or moving patterns could also elicit higher ratings, despite having no predictable association with sounds. Such videos had higher levels of raw ‘motion energy’ (ME), which we quantified using a simple computational model of motion processing in early visual cortex. Critically, only respondents reporting prior awareness of vEAR tended to show a positive correlation between video ratings and ME.

    This specific sensitivity to ME suggests that in vEAR, signals from visual motion processing may affect audition relatively directly without requiring higher-level interpretative processes. Our other findings challenge the popular assumption that individuals with synaesthesia are rare and have ideosyncratic patterns of brain hyper-connectivity. Instead, our findings of apparently high prevalence and broad associations with other synaesthesias and traits are jointly consistent with a common dependence on normal variations in physiological mechanisms of disinhibition or excitability of sensory brain areas and their functional connectivity. The prevalence of vEAR makes it easier to test such hypotheses further, and makes the results more relevant to understanding not only synaesthetic anomalies but also normal perception.

    [Keywords: individual differences, audiovisual perception, synaesthesia]

  90. ⁠, Melissa Saenz, Christof Koch (2008-08-05):

    Synaesthesia is a benign neurological condition in humans characterized by involuntary cross-activation of the senses, and estimated to affect at least 1% of the population. Multiple forms of synaesthesia exist, including distinct visual, tactile or gustatory perceptions which are automatically triggered by a stimulus with different sensory properties, such as seeing colors when hearing music.

    Surprisingly, there has been no previous report of synaesthetic sound perception. Here we report that auditory synaesthesia does indeed exist with evidence from 4 healthy adults for whom seeing visual flashes or visual motion automatically causes the perception of sound. As an objective test, we show that ‘hearing-motion synesthetes’ outperformed normal control subjects on an otherwise difficult visual task involving rhythmic temporal patterns similar to Morse code. Synesthetes had an advantage because they not could not only see, but also hear the rhythmic visual patterns.

    Hearing-motion synaesthesia could be a useful tool for studying how the auditory and visual processing systems interact in the brain.

  91. 2017-fassnidge.pdf: ⁠, Christopher Fassnidge, Claudia Cecconi Marcotti, Elliot Freeman (2017-03-01; psychology):

    • Some people claim to hear what they see: a visually-evoked auditory response (V-EAR).
    • We assess the prevalence and perceptual reality of V-EAR for the first time.
    • 22% of subjects confirmed they heard faint sounds accompanying silent visual flashes.
    • V-EAR is perceptually real enough to interfere with detection of real sounds.
    • V-EAR may be a normally-occurring precursor to visual-to-auditory synaesthesia.

    In some people, visual stimulation evokes auditory sensations. How prevalent and how perceptually real is this?

    22% of our neurotypical adult participants responded ‘Yes’ when asked whether they heard faint sounds accompanying flash stimuli, and showed statistically-significantly better ability to discriminate visual ‘Morse-code’ sequences. This benefit might arise from an ability to recode visual signals as sounds, thus taking advantage of superior temporal acuity of audition. In support of this, those who showed better visual relative to auditory sequence discrimination also had poorer auditory detection in the presence of uninformative visual flashes, though this was independent of awareness of visually-evoked sounds. Thus a visually-evoked auditory representation may occur subliminally and disrupt detection of real auditory signals. The frequent natural correlation between visual and auditory stimuli might explain the surprising prevalence of this phenomenon. Overall, our results suggest that learned correspondences between strongly correlated modalities may provide a precursor for some synaesthetic abilities.

  92. ⁠, Cory Costello, Sanjay Srivastava, Reza Rejaie, Maureen Zalewski (2021-01-25):

    The past decade has seen rapid growth in research linking stable psychological characteristics (ie., traits) to digital records of online behavior in Online Social Networks (OSNs) like Facebook and Twitter, which has implications for basic and applied behavioral sciences. Findings indicate that a broad range of psychological characteristics can be predicted from various behavioral residue online, including language used in posts on Facebook () and Twitter (), and which pages a person ‘likes’ on Facebook (eg., Kosinski, Stillwell, & Graepel, 2013). The present study examined the extent to which the accounts a user follows on Twitter can be used to predict individual differences in self-reported anxiety, depression, post-traumatic stress, and anger. Followed accounts on Twitter offer distinct theoretical and practical advantages for researchers; they are potentially less subject to overt impression management and may better capture passive users. Using an approach designed to minimize overfitting and provide unbiased estimates of predictive accuracy, our results indicate that each of the four constructs can be predicted with modest accuracy (out-of-sample r’s of approximately 0.2). Exploratory analyses revealed that anger, but not the other constructs, was distinctly reflected in followed accounts, and there was some indication of bias in predictions for women (vs. men) but not for racial/​​​​ethnic minorities (vs. majorities). We discuss our results in light of theories linking psychological traits to behavior online, applications seeking to infer psychological characteristics from records of online behavior, and ethical issues such as algorithmic bias and users’ privacy.

    …As planned in the initial pre-registered protocol, we evaluated both selected and non-selected models in the holdout data. For our central research question, estimating how well mental health can be predicted by followed accounts, we found that the selected models achieved moderate, nontrivial accuracy for all four outcomes. For depression, the correlation between predicted and observed score was r = 0.24, for anxiety it was r = 0.20, for post-traumatic stress it was r = 0.19, and for anger it was r = 0.23. Figure 6 shows these estimates.

    To aid in interpretation, Figure 6 also shows two relevant estimates from prior work to serve as comparative benchmarks: the predictive accuracies for well-being and neuroticism from Kosinski and colleagues’ (2013) paper predicting psychological constructs from Facebook like-ties. As seen in Figure 6, the present estimates are between these two prior estimates, suggesting that twitter friends predict mental health about as well as Facebook likes predict related constructs.

    Figure 6: Out-of-Sample Accuracy for Selected Models

    The correlations from both the selected and non-selected models are shown in Figure 7. This allows us to evaluate how effective the model-selection process was in picking the best-performing model. The selected model out-performed the eleven non-selected models for anger and post-traumatic stress, was second best for depression, and fourth best for anxiety. When one or more non-selected models outperformed the selected ones, it was by a relatively small margin, but the lowest-performing non-selected models were substantially worse than the selected ones.

    The correlations from both the selected and non-selected models are shown in Figure 7. This allows us to evaluate how effective the model-selection process was in picking the best-performing model. The selected model out-performed the eleven non-selected models for anger and post-traumatic stress, was second best for depression, and fourth best for anxiety. When one or more non-selected models outperformed the selected ones, it was by a relatively small margin, but the lowest-performing non-selected models were substantially worse than the selected ones.

    Figure 7: Out-of-sample Accuracy for Selected and Non-Selected Models
  93. Everything

  94. ⁠, Melisande Aellen, Judith M. Burkart, Redouan Bshary (2021-01-08):

    Differences in human general intelligence or reasoning ability can be quantified with the psychometric factor g, because individual performance across cognitive tasks is positively correlated. g also emerges in mammals and birds, is correlated with brain size and may similarly reflect general reasoning ability and behavioural flexibility in these species. To exclude the alternative that these positive cross-correlations may merely reflect the general biological quality of an organism or an inevitable by-product of having brains it is paramount to provide solid evidence for the absence of g in at least some species. Here, we show that wild-caught cleaner fish ⁠, a fish species otherwise known for its highly sophisticated social behaviour, completely lacks g when tested on ecologically non-relevant tasks. Moreover, performance in these experiments was not or negatively correlated with an ecologically relevant task, and in none of the tasks did fish caught from a high population density site outperform fish from a low-density site. g is thus unlikely a default result of how brains are designed, and not an automatic consequence of variation in social complexity. Rather, the results may reflect that g requires a minimal brain size, and thus explain the conundrum why the average mammal or bird has a roughly 10 times larger brain relative to body size than ectotherms. Ectotherm brains and cognition may therefore be organized in fundamentally different ways compared to endotherms.

  95. 1990-langer.pdf: ⁠, Mark Langer (1990; anime):

    (⁠, 1941) is shown to contain two disparate animation traditions operating simultaneously within the ⁠. Sequences alternate between those presented in Disney’s West Coast style, an expression of the classic Hollywood tradition, and an imported East Coast style, which emphasized artifice, nonlinear narrative, and ⁠.

    Associated with such New York studios as Fleischer and Van Beuren, the East Coast Style in Dumbo is traced to the contributions of specific New York-trained animators, who were able to operate relatively freely due to Disney’s own lack of involvement [see ]. The sequence is analyzed as a major example of the East Coast influence in the film.

  96. 2021-asnicar.pdf: ⁠, Francesco Asnicar, Sarah E. Berry, Ana M. Valdes, Long H. Nguyen, Gianmarco Piccinno, David A. Drew, Emily Leeming, Rachel Gibson, Caroline Roy, Haya Al Khatib, Lucy Francis, Mohsen Mazidi, Olatz Mompeo, Mireia Valles-Colomer, Adrian Tett, Francesco Beghini, Leonard Dubois, Davide Bazzani, Andrew Maltez Thomas, Chloe Mirzayi, Asya Khleborodova, Sehyun Oh, Rachel Hine, Christopher Bonnett, Joan Capdevila, Serge Danzanvilliers, Francesca Giordano, Ludwig Geistlinger, Levi Waldron, Richard Davies, George Hadjigeorgiou, Jonathan Wolf, Jose M. Ordovas, Christopher Gardner, Paul W. Franks, Andrew T. Chan, Curtis Huttenhower, Tim D. Spector, Nicola Segata (2021-01-11; biology):

    The gut is shaped by diet and influences host metabolism; however, these links are complex and can be unique to each individual. We performed deep metagenomic sequencing of 1,203 gut microbiomes from 1,098 individuals enrolled in the Personalised Responses to Dietary Composition Trial (PREDICT 1) study, whose detailed long-term diet information, as well as hundreds of fasting and same-meal postprandial cardiometabolic blood marker measurements were available. We found many statistically-significant associations between microbes and specific nutrients, foods, food groups and general dietary indices, which were driven especially by the presence and diversity of healthy and plant-based foods. Microbial biomarkers of obesity were reproducible across external publicly available cohorts and in agreement with circulating blood metabolites that are indicators of cardiovascular disease risk. While some microbes, such as Prevotella copri and Blastocystis spp., were indicators of favorable postprandial glucose metabolism, overall microbiome composition was predictive for a large panel of cardiometabolic blood markers including fasting and postprandial glycemic, lipemic and inflammatory indices. The panel of intestinal species associated with healthy dietary habits overlapped with those associated with favorable cardiometabolic and postprandial markers, indicating that our large-scale resource can potentially stratify the gut microbiome into generalizable health levels in individuals without clinically manifest disease.

  97. ⁠, Mammalian Methylation Consortium, Ake T. Lu, Zhe Fei, Amin Haghani, Todd R. Robeck, Joseph A. Zoller, Caesar Z. Li, Joshua Zhang, Julia Ablaeva, Danielle M. Adams, Javier Almunia, Reza Ardehali, Adriana Arneson, C. Scott Baker, Katherine Belov, Pete Black, Daniel T. Blumstein, Eleanor K. Bors, Charles E. Breeze, Robert T. Brooke, Janine L. Brown, Alex Caulton, Julie M. Cavin, Ioulia Chatzistamou, Hao Chen, Priscila Chiavellini, Oi-Wa Choi, Shannon Clarke, Joseph DeYoung, Christopher Dold, Candice K. Emmons, Stephan Emmrich, Chris G. Faulkes, Steven H. Ferguson, Carrie J. Finno, Jean-Michel Gaillard, Eva Garde, Vadim N. Gladyshev, Vera Gorbunova, Rodolfo G. Goya, Matthew J. Grant, Erin N. Hales, M. Bradley Hanson, Martin Haulena, Andrew N. Hogan, Carolyn J. Hogg, Timothy A. Hore, Anna J. Jasinska, Gareth Jones, Eve Jourdain, Olga Kashpur, Harold Katcher, Etsuko Katsumata, Vimala Kaza, Hippokratis Kiaris, Michael S. Kobor, Pawel Kordowitzki, William R. Koski, Brenda Larison, Sang-Goo Lee, Ye C. Lee, Marianne Lehmann, Jean-Francois Lemaitre, Andrew J. Levine, Cun Li, Xinmin Li, David T. S. Lin, Nicholas Macoretta, Dewey Maddox, Craig O. Matkin, Julie A. Mattison, June Mergl, Jennifer J. Meudt, Khyobeni Mozhui, Asieh Naderi, Martina Nagy, Pritika Narayan, Peter W. Nathanielsz, Ngoc B. Nguyen, Christof Niehrs, Alexander G. Ophir, Elaine A. Ostrander, Perrie O’Tierney Ginn, Kim M. Parsons, Kimberly C. Paul, Matteo Pellegrini, Gabriela M. Pinho, Jocelyn Plassais, Natalia A. Prado, Benjamin Rey, Beate R. Ritz, Jooke Robbins, Magdalena Rodriguez, Jennifer Russell, Elena Rydkina, Lindsay L. Sailer, Adam B. Salmon, Akshay Sanghavi, Kyle M. Schachtschneider, Dennis Schmitt, Todd Schmitt, Lars Schomacher, Lawrence B. Schook, Karen E. Sears, Andrei Seluanov, Dhanansayan Shanmuganayagam, Anastasia Shindyapina, Kavita Singh, Ishani Sinha, Russel G. Snell, Elham Soltanmaohammadi, Matthew L. Spangler, Lydia Staggs, Karen J. Steinman, Victoria J. Sugrue, Balazs Szladovits, Masaki Takasugi, Emma C. Teeling, Michael J. Thompson, Bill Van Bonn, Sonja C. Vernes, Diego Villar, Harry V. Vinters, Mary C. Wallingford, Nan Wang, Robert K. Wayne, Gerald S. Wilkinson, Christopher K. Williams, Robert W. Williams, X. William Yang, Brent G. Young, Bohan Zhang, Zhihui Zhang, Peng Zhao, Yang Zhao, Joerg Zimmermann, Wanding Zhou, Jason Ernst, Ken Raj, Steve Horvath (2021-01-19):

    Aging is often perceived as a degenerative process caused by random accrual of cellular damage over time. In spite of this, age can be accurately estimated by epigenetic clocks based on DNA methylation profiles from almost any tissue of the body. Since such pan-tissue epigenetic clocks have been successfully developed for several different species, it is difficult to ignore the likelihood that a defined and shared mechanism instead, underlies the aging process.

    To address this, we generated 10,000 methylation arrays, each profiling up to 37,000 cytosines in highly-conserved stretches of DNA, from over 59 tissue-types derived from 128 mammalian species. From these, we identified and characterized specific cytosines, whose methylation levels change with age across mammalian species. Genes associated with these cytosines are greatly enriched in mammalian developmental processes and implicated in age-associated diseases.

    From the methylation profiles of these age-related cytosines, we successfully constructed 3 highly accurate universal mammalian clocks for eutherians, and 1 universal clock for marsupials. The universal clocks for eutherians are similarly accurate for estimating ages (r > 0.96) of any mammalian species and tissue with a single mathematical formula.

    Collectively, these new observations support the notion that aging is indeed evolutionarily conserved and coupled to developmental processes across all mammalian species—a notion that was long-debated without the benefit of this new and compelling evidence.

  98. ⁠, Mikolaj Ogrodnik, Shane A. Evans, Edward Fielder, Stella Victorelli, Patrick Kruger, Hanna Salmonowicz, Bettina M. Weigand, Ayush D. Patel, Tamar Pirtskhalava, Christine L. Inman, Kurt O. Johnson, Stephanie L. Dickinson, Azucena Rocha, Marissa J. Schafer, Yi Zhu, David B. Allison, Thomas von Zglinicki, Nathan K. LeBrasseur, Tamar Tchkonia, Nicola Neretti, João F. Passos, James L. Kirkland, Diana Jurk (2021-01-20):

    Cellular is characterized by an irreversible cell cycle arrest and a pro-inflammatory senescence-associated secretory phenotype (SASP), which is a major contributor to aging and age-related diseases. Clearance of senescent cells has been shown to improve brain function in mouse models of neurodegenerative diseases. However, it is still unknown whether senescent cell clearance alleviates cognitive dysfunction during the aging process. To investigate this, we first conducted single-nuclei and single-cell RNA-seq in the hippocampus from young and aged mice. We observed an age-dependent increase in p16Ink4a senescent cells, which was more pronounced in microglia and oligodendrocyte progenitor cells and characterized by a SASP. We then aged INK-ATTAC mice, in which p16Ink4a-positive senescent cells can be genetically eliminated upon treatment with the drug AP20187 and treated them either with AP20187 or with the cocktail and ⁠. We observed that both strategies resulted in a decrease in p16Ink4a exclusively in the microglial population, resulting in reduced microglial activation and reduced expression of SASP factors. Importantly, both approaches statistically-significantly improved cognitive function in aged mice. Our data provide proof-of-concept for senolytic interventions’ being a potential therapeutic avenue for alleviating age-associated cognitive impairment.

  99. ⁠, Demetres Kostas, Stephane Aroca-Ouellette, Frank Rudzicz (2021-01-28):

    Deep neural networks (DNNs) used for brain-computer-interface (BCI) classification are commonly expected to learn general features when trained across a variety of contexts, such that these features could be fine-tuned to specific contexts. While some success is found in such an approach, we suggest that this interpretation is limited and an alternative would better leverage the newly (publicly) available massive EEG datasets. We consider how to adapt techniques and architectures used for language modelling (LM), that appear capable of ingesting awesome amounts of data, towards the development of encephalography modelling (EM) with DNNs in the same vein. We specifically adapt an approach effectively used for automatic speech recognition, which similarly (to LMs) uses a self-supervised training objective to learn compressed representations of raw data signals. After adaptation to EEG, we find that a single pre-trained model is capable of modelling completely novel raw EEG sequences recorded with differing hardware, and different subjects performing different tasks. Furthermore, both the internal representations of this model and the entire architecture can be fine-tuned to a variety of downstream BCI and EEG classification tasks, outperforming prior work in more task-specific (sleep stage classification) self-supervision.

  100. ⁠, Peter Norvig (2020-12-15):

    …A few days ago I watched “How Computers Learn” talk by Peter Norvig. In this talk, Peter talked about how Google did machine learning and at one point he mentioned that at Google they also applied machine learning to hiring. He said that one thing that was surprising to him was that being a winner at programming contests was a negative factor for performing well on the job. Peter added that programming contest winners are used to cranking solutions out fast and that you performed better at the job if you were more reflective and went slowly and made sure things were right…

    :

    I regret causing confusion here. It turns out that this correlation was true on the initial small data set, but after gathering more data, the correlation went away. So the real lesson should be: “if you gather data on a lot of low-frequency events, some of them will display a spurious correlation, about which you can make up a story.”

    [The null correlation likely reflects the usual attenuation in screening scenarios from power/​​​​n of rare traits like programming competition victories, ⁠, and ⁠.]

  101. 2007-keeley.pdf: ⁠, Lawrence H. Keeley, Marisa Fontana, Russell Quick (2007-03-01; history):

    This article discusses several universal features of fortifications and distinguishes those features that are unequivocally military in function. The evidence adduced includes the features of known historic fortifications, relevant prescriptions by ancient military authors, and geometry. The archaeologically visible features that are universally used in military defenses are V-sectioned ⁠, “defended” (especially baffled) gates, and ⁠. It is also noted that ritual, ceremonial, or any other peaceful activities conducted within an enclosure having these architectural features does not preclude its obvious military function.

    [Keywords: ancient fortifications, warfare, prehistoric enclosures, pre-gunpowder weapons, symbolism, warfare, noble savage myth, prehistoric war, ]

    Figure 3: Schematic defensive gate plans
    Figure 5: Actual defensive gate plans. Redrawn from Andersen (1997), Barkay (1992), Barnes (1999), Cunliffe (1997), Demarest et al. (1997), Dyer (1992), Hogg (1981), Lawrence (1979), Mazar (1990), Wrightman (1985).
  102. 2020-jeremytankard-footnote-36-redisturbed.pdf: “Footnote 36: Redisturbed: In This Issue We're Focusing on the Redisturbed Typeface For The New Decade [Redisturbed is a fresh look at our original Disturbance typeface from 1993. Looking deeper at the concept of a unicase alphabet and designing it for expanded use today. More weights, optical sizes, language support and OpenType features.]”⁠, Serif Affinity Publisher (Jul 29 2020)

  103. ⁠, Brad Plumer, Christopher Flavelle (2021-01-18):

    A growing number of corporations are pouring money into so-called engineered carbon removal—for example, using giant fans to pull carbon dioxide from the air and trap it. The companies say these techniques, by offsetting emissions they can’t otherwise cut, may be the only way to fulfill lofty “net zero” pledges.

    Occidental Petroleum and United Airlines are investing in a large “direct air capture” plant in Texas that will use fans and chemical agents to scrub carbon dioxide from the sky and inject it underground. Stripe and Shopify⁠, two e-commerce companies, have each begun spending at least $1 million per year on start-ups working on carbon removal techniques, such as sequestering the gas in concrete for buildings. Microsoft will soon announce detailed plans to pay to remove one million tons of carbon dioxide.

    The United Nations-backed Intergovernmental Panel on Climate Change has said nations may need to remove between 100 billion and 1 trillion tons of carbon dioxide from the atmosphere this century to avert the worst effects of climate change—far more than can be absorbed by simply planting more trees. But many carbon removal technologies remain too expensive for widespread use, often costing $600 or more per ton of carbon. The hope, companies say, is that early investments can help drive down prices to something more palatable—say, $100 per ton or less—much as investments in wind and solar have made those energy sources cheaper over time.

    There are working prototypes of such devices. But for years, engineers developing carbon removal struggled to find investors.

    “It’s a chicken-or-egg problem”, said Nan Ransohoff, head of climate at ⁠, an online payments company based in San Francisco. “The best way to bring down the cost is to start deploying these technologies at scale. But until there are actual customers, no one’s going to build them.”

    To help break the impasse, Stripe announced in 2019 that it would begin spending at least $1 million annually on carbon removal, without worrying about the price per ton initially. The goal was to evaluate companies working on promising technologies and offer them a reliable stream of income. After convening outside experts to review applications, Stripe announced its first round of payments last May. That included an agreement with Climeworks, a Swiss start-up that has already built several small direct air capture plants in Europe. Stripe also paid $250,000 to Project Vesta⁠, a nonprofit planning to sprinkle volcanic minerals on beaches, testing to see how much carbon dioxide they absorb as the waves break them down, through a process known as ⁠. The companies receiving Stripe’s funding say the money has been crucial. “It’s existential for us”, said Peter Reinhardt, co-founder of Charm Industrial, a start-up that Stripe is paying to remove 416 tons of carbon dioxide at $600 per ton. His company will take crop waste and convert it into an oil that can be injected underground, rather than letting the waste decay and release carbon back into the atmosphere.

    Other companies are similarly investing. The German automaker Audi is paying Climeworks to capture and remove 1,000 tons of carbon dioxide from a new direct air capture facility in Iceland, scheduled to come online this year. Climeworks has also signed an agreement with Swiss Re, the insurance giant, which this month created a dedicated funding stream for carbon removal. Shopify, a Canadian e-commerce company, has already committed $1.6 million to various carbon-removal start-ups. Christoph Gebald, Climeworks’ co-director, said his company now had more than 50 corporate clients paying to capture and store carbon dioxide. His goal is to build enough facilities to remove 30 million to 50 million tons a year from the atmosphere by 2030.

    …It remains to be seen, however, whether carbon removal companies can lower their prices to a level that’s attractive to the average buyer. Carbon Engineering, a Canadian company supplying the technology for the direct air capture plant in Texas, thinks it get prices down to $94 to $232 a ton.

  104. https://freakonomics.com/podcast/advertising-part-1/

  105. https://freakonomics.com/podcast/advertising-part-2/

  106. 2020-aral.pdf: ⁠, Sinan Aral, Paramveer S. Dhillon (2020-08-14; advertising⁠, economics):

    Most online content publishers have moved to subscription-based business models regulated by digital paywalls. But the managerial implications of such freemium content offerings are not well understood. We, therefore, utilized microlevel user activity data from the New York Times to conduct a large-scale study of the implications of digital paywall design for publishers. Specifically, we use a quasi-experiment that varied the (1) quantity (the number of free articles) and (2) exclusivity (the number of available sections) of free content available through the paywall to investigate the effects of paywall design on content demand, subscriptions, and total revenue.

    The paywall policy changes we studied suppressed total content demand by about 9.9%, reducing total advertising revenue. However, this decrease was more than offset by increased subscription revenue as the policy change led to a 31% increase in total subscriptions during our seven-month study, yielding net positive revenues of over $287,104$230,0002013. The results confirm an economically-significant impact of the newspaper’s paywall design on content demand, subscriptions, and net revenue. Our findings can help structure the scientific discussion about digital paywall design and help managers optimize digital paywalls to maximize readership, revenue, and profit.

    Figure 5: Average Number of Articles Read on the Browser NumArticles[^Browser^~it~]{.supsub} by Subscribers and Non-subscribers. ¶ Notes: (1) “High Quantity” represents access to all the published content and “Low Quantity” denotes access to 3 articles per day. Similarly, “High Diversity” represents access to all sections whereas “Low Diversity” represents access to content from only top news and video sections. (2) For simplicity of exposition, the plot only shows readers who stayed subscribers or non-subscribers throughout. (3) The fitted line in the plot is the least-squares line.
  107. Ads

  108. 2010-schuh.pdf: ⁠, Scott Schuh, Oz Shy, Joanna Stavins (2010-08-31; economics):

    and generate an implicit monetary transfer to credit card users from non-card (or “cash”) users because merchants generally do not set differential prices for card users to recoup the costs of fees and rewards. On average, each cash-using household pays $196$1492010 to card-using households and each card-using household receives $1,488$1,1332010 from cash users every year. Because credit card spending and rewards are positively correlated with household income, the payment instrument transfer also induces a regressive transfer from low-income to high-income households in general. On average, and after accounting for rewards paid to households by banks, the lowest-income household ($26,260$20,0002010 or less annually) pays $28$212010 and the highest-income household ($196,954$150,0002010 or more annually) receives $985$7502010 every year. We build and calibrate a model of consumer payment choice to compute the effects of merchant fees and card rewards on consumer welfare. Reducing merchant fees and card rewards would likely increase consumer welfare.

  109. 2019-quinn.pdf: ⁠, William Quinn (2019-01-29; economics):

    This article examines the extent to which Victorian investors were constrained. While previous research suggests that there were relatively few limits on arbitrage, this article argues that short-sales of stocks outside the Official List were indirectly constrained by the risk of being cornered. Evidence for this hypothesis comes from three corners in cycle company shares [during the ] which occurred in 1896–1897, two of which resulted in substantial losses for short-sellers. Legal efforts to retrieve funds lost in a corner were unsuccessful, and the court proceedings reveal a widespread contempt for short-sellers, or ‘bears’, among the general public. Consistent with the hypothesis that these episodes affected the market, this study’s findings show that cycle companies for which cornering risk was greater experienced disproportionately lower returns during a subsequent crash in the market for cycle shares. This evidence suggests that, under certain circumstances, short-selling shares in Britain prior to 1900 could have been much riskier than previously thought.

    …Cycle share prices are found to have risen by over 200% in the early months of 1896, and remained at a relatively high level until March 1897. This boom was accompanied by the promotion of many new cycle firms, with 363 established in 1896 and another 238 during the first half of 1897. This was followed by a crash, with cycle shares losing 76% of their peak value by the end of 1898. The financial press appears to have been aware that a crash was imminent, repeatedly advising investors to sell cycle shares during the first half of 1897. Interestingly, however, these articles never explicitly recommended short-selling cycle shares…Between 1890 and 1896, a succession of major technological innovations substantially increased the demand for British bicycles.37 Bicycle production increased in response, with the number of British cycle companies in existence quadrupling between 1889 and 1897.38 Cycle firms, most of which were based in and around Birmingham, took advantage of the boom of 1896 by going public, resulting in the successful promotion of £17.3 million worth of cycle firms in 1896 and a further £7.4 million in 1897.39 By 1897 there was an oversupply problem in the trade, which was worsened by an exponential increase in the number of bicycles imported from the US.40 The bicycle industry entered recession, and the number of Birmingham-based cycle firms fell by 54% between 1896 and 1900.41

    …The total paid for the 200 shares [by the short-trader Hamlyn] was £2,550, to be delivered at a price of £231.25, for a loss of £2,318.75. To put this loss in context, Hamlyn’s barrister noted that, had he succeeded in obtaining the shares at allotment, the profit would have been only £26.

    Figure 1: Cycle share index vs. subsequent reported dividends, 1895–1898
  110. https://www.tabletmag.com/sections/arts-letters/articles/on-venus-have-we-got-a-rabbi

  111. 2013-dubin-fabliauxtranslations-stmartinsfourwishes.pdf: “St Martin's Four Wishes”⁠, Ned Dubin