“DALL·E 1: Creating Images from Text”, OpenAI (GPT-3-12.5b generating 1280 tokens → VQ-VAE pixels; generates illustration & photos); “CLIP (Contrastive Language-Image Pre-training): Connecting Text and Images”, OpenAI (et al 2021 : zero-shot image understanding via text description—useful for much more than just ranking DALL·E 1 samples by quality)
Further blessings of scale: simple contrastive training on n = 400m leads to remarkable generalization & combinatorial flexibility of image generation by DALL·E 1, and CLIP learns to reach image classification SOTA by zero-shot on many datasets, with more human-like errors & less degradation out of samples than rivals, while costing the same to train. OpenAI released their smallest CLIP model (the “ViT-B/32”-equivalent) and people are discovering it seems able to do just about anything without any further training—the paper notes that it does everything from “fine-grained object classification, geo-localization, action recognition in videos, and OCR”, but there’s so much more, and you can use it to generate image captions/descriptions, classify your anime images, pull a specific target image description by gradient ascent or out of another neural network such as an ImageNet BigGAN or TADNE StyleGAN2-ext (or, why not, synthesize images images embodying abstract concepts like emoji or words like “nightmare fuel” or “confusion”!), search your image datasets by embedding, find mislabeled images (eg. by using “upside down” as the prompt)… One wonders, like GPT-3, how much better the largest CLIP (“L/14-336px”) is and how many ways of using it (or DALL·E 1) remain to be found? And why prediction losses work so well in one place, but then contrastive elsewhere?
For perspective: there are newly-minted PhDs going on the job market who got excited about deep learning because of these new “resnet” things; undergrads who applied to grad school because BERT et al were blowing open NLP & extending neural supremacy to natural language would not yet have passed quals; and it has been only 1 academic semester since GPT-3 was announced. Or to put it quantitatively, for just sequence modeling: it has been 8,478 days since LSTM RNNs were published; 3,045 days since AlexNet’s ImageNet scores were released; 1,880 days since residual networks were published in a paper; 1,330 days since “Attention Is All You Need” hit Arxiv; 844 days since BERT’s paper was published; 718 days since GPT-2 was announced; 353 days since SimCLR, and 249 days since GPT-3 was; and 27 days since CLIP/DALL·E 1. Spring is coming. (Some still insist we need not worry about “overpopulation on Mars” for >18,264 more days…)