January 2021 News

Gwern Branwen

January 2021 Gwern.net newsletter with links on AI scaling up and down.

2020-01-02–2024-05-03 finished certainty: log importance: 0 bibliography

Writings
Links

[Warning: JavaScript Disabled!]

[For support of key website features (link annotation popups/popovers & transclusions, collapsible sections, backlinks, tablesorting, image zooming, sidenotes etc), you must enable JavaScript.]

January 2021’s Gwern.net newsletter is now out; previous, December 2020 (archives). This is a collation of links and summary of major changes, overlapping with my Changelog & /r/gwern; brought to you by my donors on Patreon.

Writings

“Danbooru2020: A Large-Scale Crowdsourced & Tagged Anime Illustration Dataset”
This Anime Does Not Exist.ai (TADNE) (discussion)
Gwern.net: +return-to-top floating button; popups: can now be disabled (use the ‘gear’ icon); final reimplementation (dynamic JS now; memoizing the recursive inlining, however clever & elegant, turns out to have painful edge-cases & still not be efficient enough—web browsers really don’t like loading hundreds of kilobytes of extra HTML)

Links

AI

Matters Of Scale:

Scaling up:
- “DALL·E 1: Creating Images from Text”, OpenAI (GPT-3-12.5b generating 1280 tokens → VQ-VAE pixels; generates illustration & photos); “CLIP (Contrastive Language-Image Pre-training): Connecting Text and Images”, OpenAI (Radford et al 2021: zero-shot image understanding via text description—useful for much more than just ranking DALL·E 1 samples by quality)
  
  Further blessings of scale: simple contrastive training on n = 400m leads to remarkable generalization & combinatorial flexibility of image generation by DALL·E 1, and CLIP learns to reach image classification SOTA by zero-shot on many datasets, with more human-like errors & less degradation out of samples than rivals, while costing the same to train. OpenAI released their smallest CLIP model (the “ViT-B/32”-equivalent) and people are discovering it seems able to do just about anything without any further training—the paper notes that it does everything from “fine-grained object classification, geo-localization, action recognition in videos, and OCR”, but there’s so much more, and you can use it to generate image captions/descriptions, classify your anime images, pull a specific target image description by gradient ascent or out of another neural network such as an ImageNet BigGAN or TADNE StyleGAN2-ext (or, why not, synthesize images images embodying abstract concepts like emoji or words like “nightmare fuel” or “confusion”!), search your image datasets by embedding, find mislabeled images (eg. by using “upside down” as the prompt)… One wonders, like GPT-3, how much better the largest CLIP (“L/14-336px”) is and how many ways of using it (or DALL·E 1) remain to be found? And why prediction losses work so well in one place, but then contrastive elsewhere?
  
  For perspective: there are newly-minted PhDs going on the job market who got excited about deep learning because of these new “resnet” things; undergrads who applied to grad school because BERT et al were blowing open NLP & extending neural supremacy to natural language would not yet have passed quals; and it has been only 1 academic semester since GPT-3 was announced. Or to put it quantitatively, for just sequence modeling: it has been 8,478 days since LSTM RNNs were published; 3,045 days since AlexNet’s ImageNet scores were released; 1,880 days since residual networks were published in a paper; 1,330 days since “Attention Is All You Need” hit Arxiv; 844 days since BERT’s paper was published; 718 days since GPT-2 was announced; 353 days since SimCLR, and 249 days since GPT-3 was; and 27 days since CLIP/DALL·E 1.¹ Spring is coming. (Some still insist we need not worry about “overpopulation on Mars” for >18,264 more days…)
- “Meta Pseudo Labels”, Pham et al 2020 (90% on ImageNet by pretraining a meta-learning teacher using JFT-300M on a TPUv3-2048)
- “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”, Fedus et al 2021 (1.57t-parameter GShard followup; the mixture-of-experts approach, while scaling stably, starts showing its limits)
Scaling down:
- “DeiT: Training data-efficient image transformers & distillation through attention”, Touvron et al 2020 (scaling Transformer classifiers down to ImageNet+1-GPU); “BoTNet: Bottleneck Transformers for Visual Recognition”, Srinivas et al 2021/“Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, Yuan et al 2021 (hybrids); “not-so-BigGAN: Generating High-Fidelity Images on Small Compute with Wavelet-based Super-Resolution”, Han et al 2020/“VQGAN: Taming Transformers for High-Resolution Image Synthesis”, Esser et al 2020 (training >1024px Transformer GANs on just 2 GPUs)
  
  Transformer supremacy in image-related tasks continues, and GANs are becoming increasingly hybridized. Do pure-GANs have a future, now that VAEs and autoregressive models are making such inroads into both the highest-quality & lowest-compute sample generation? To take the GAN/DRL analogy seriously, perhaps they were they ultimately a dead end, akin to trying to learn everything from rewards, and an adversarial GAN loss ought to be only the cherry on the cake of a large unsupervised/semi-supervised generative model.
- “ZeRO-Offload: Democratizing Billion-Scale Model Training”, Ren et al 2021 (partial CPU training for 13b-parameter models on 1 V100 GPU, scaling to 128 GPUs)
- “Prefix-Tuning: Optimizing Continuous Prompts for Generation”, Li & Liang2021 (could the PET & CLIP trick of averaging multiple embeddings to yield much better performance be reused for GPT-3 prompts to greatly improve prompting? The fact that the prefix-tuning, by directly optimizing the prompt embeddings, yields better performance than even single optimized text prompts, suggests so. The user could provide 3 or 4 similar prompts, and synthesize them into a single super-prompt to better program GPT-3…)
- “Scaling down Deep Learning”, Greydanus2020 (cute: parametric simplified-MNIST for rapid iteration on highly-optimized tiny NNs: experiments in lottery-ticket & meta-learning of LRs/activations)
- “The neural network of the Stockfish chess engine” (very lightweight NN designed for incremental recomputation over changing board states)
“Transformers in Vision: A Survey”, Khan et al 2021
OpenAI departures: Dario Amodei, Sam McCandlish, Tom Brown, Tom Henighan, Chris Olah, Jack Clark, Ben Mann, Paul Christiano et al leave—most for an unspecified new entity (“the elves leave Middle Earth”?)

And the rest:

“2020 AI Alignment Literature Review and Charity Comparison”, Larks
“Grounded Language Learning Fast and Slow”, Hill et al 2020
“DeBERTa: Decoding-enhanced BERT with Disentangled Attention”, He et al 2020 (SuperGLUE falls)
“Solving Mixed Integer Programs Using Neural Networks”, Nair et al 2020/Sonnerat et al 2021
“Towards Fully Automated Manga Translation”, Hinami et al 2020
“UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers”, Hu et al 2021
“FERM: A Framework for Efficient Robotic Manipulation”, Zhan et al 2021 (contrastive semi-supervised learning + data augmentation for sample-efficiency)
“XMC-GAN: Cross-Modal Contrastive Learning for Text-to-Image Generation”, Zhang et al 2021

Genetics

Everything Is Heritable:

“Nurture might be nature: cautionary tales and proposed solutions”, Hart et al 2021
“A genetic perspective on the association between exercise and mental health in the era of genome-wide association studies”, de Geus2020; “Evidence for shared genetics between physical activity, sedentary behavior and adiposity-related traits”, Schnurr et al 2020
“Antidepressant Response in Major Depressive Disorder: A Genome-wide Association Study”, Pain et al 2020
“Genome wide analysis of gene dosage in 24,092 individuals shows that 10,000 genes modulate cognitive ability”, Huguet et al 2020 (yep, still polygenic)
“GWAS of three molecular traits highlights core genes and pathways alongside a highly polygenic background”, Sinnott-Armstrong et al 2021
“Genome-scale sequencing and analysis of human, wolf and bison DNA from 25,000 year-old sediment”, Gelabert et al 2021 (incredible this is possible)
“Disentangling sex differences in the shared genetic architecture of PTSD, traumatic experiences, and social support with body size and composition”, Carvalho et al 2021 (LCV)

Recent Evolution:

“African genetic diversity and adaptation inform a precision medicine agenda”, Pereira et al 2021; “The influence of evolutionary history on human health and disease”, Benton et al 2021; “Local adaptation and archaic introgression shape global diversity at human structural variant loci”, Yan et al 2021
“Genome scans of dog behavior implicate a gene network underlying psychopathology in mammals, including humans”, Zapata et al 2021
“Natural Selection in Contemporary Humans is Linked to Income and Substitution Effects”, Hugh-Jones & Abdellaoui2021
“The diversity and function of sourdough starter microbiomes”, Landis et al 2021 (crowdsourced sourdough show little trace of geographic origins?)

Engineering:

“In vivo base editing rescues Hutchinson-Gilford progeria syndrome in mice”, Koblan et al 2021
“From Genotype to Phenotype: polygenic prediction of complex human traits”, Raben et al 2021

Statistics/Meta-Science

“The Quantum Field Theory on Which the Everyday World Supervenes”, Carroll2021 (“…we have reason to be confident that the laws of physics underlying the phenomena of everyday life are completely known” because all unknown particles/fields are constrained to being extremely rare/weak, eg. by Adelberger et al 2009)
“How accurate are citations of frequently cited papers in biomedical literature?”, Pavlovic et al 2020 (includes original author’s evaluation of whether a citation of their work is correct)
“Energy-Efficient Algorithms”, Demaine et al 2016 (reversible computing asymptotics: constant-factor stacks/arrays, 𝒪(log n) time/energy AVL trees, 𝒪(n) space sorts, & various 𝒪(Vertex+Edge) time/space/energy graph searches)
“The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis”, Smith & Winkler2006 (regression to the mean is everywhere; another example of why Bayes & decision theory are two great flavors that go great together)