Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects.
In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2.
These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation.
Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL·E and CogView) generation. Its application to video generation is still facing many challenges: the potential huge computation cost makes the training from scratch unaffordable; the scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics.
In this work, we present a 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips.
As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (eg. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL·E 2 [vice-versa], and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See homepage for an overview of the results.
[CogView1; demo?checkpoints?] The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images.
In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM) [cf. MAE], and finetune it for fast super-resolution.
The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL·E-2 [looks noticeably worse, IMO], and naturally supports interactive text-guided editing on images.
…Although conditioning image generation on CLIPembeddings improves diversity, this choice does come with certain limitations. In particular, unCLIP is worse at binding attributes to objects than a corresponding GLIDE model. In Figure 14, we find that unCLIP struggles more than GLIDE with a prompt where it must bind 2 separate objects (cubes) to 2 separate attributes (colors). We hypothesize that this occurs because the CLIP embedding itself does not explicitly bind attributes to objects, and find that reconstructions from the decoder often mix up attributes and objects, as shown in Figure 15.
A similar and likely related issue is that unCLIP struggles at producing coherent text, as illustrated in Figure 16; it is possible that the CLIP embedding does not precisely encode spelling information of rendered text. This issue is likely made worse because the BPE encoding we use obscures the spelling of the words in a caption from the model, so the model needs to have independently seen each token written out in the training images in order to learn to render it.
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.
[video] Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality.
We propose a novel text-to-image method that addresses these gaps by (1) enabling a simple control mechanism complementary to text in the form of a scene, (2) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (3) adapting classifier-free guidance for the transformer use case.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512×512 pixels, substantially improving visual quality.
Through scene controllability, we introduce several new capabilities: (1) Scene editing, (2) text editing with anchor scenes, (3) overcoming out-of-distribution text prompts, and (4) story illustration generation, as demonstrated in the story we wrote.
Generating images from textual descriptions has gained a lot of attention. Recently, DALL·E, a multimodal transformer language model, and its variants have shown high-quality text-to-image generation capabilities with a simple architecture and training objective, powered by large-scale training data and computation. However, despite the interesting image generation results, there has not been a detailed analysis on how to evaluate such models. In this work, we investigate the reasoning capabilities and social biases of such text-to-image generative transformers in detail. First, we measure four visual reasoning skills: object recognition, object counting, color recognition, and spatial relation understanding. For this, we propose PaintSkills, a diagnostic dataset and evaluation toolkit that measures these four visual reasoning skills. Second, we measure the text alignment and quality of the generated images based on pretrained image captioning, image-text retrieval, and image classification models. Third, we assess social biases in the models. For this, we suggest evaluation of gender and racial biases of text-to-image generation models based on a pretrained image-text retrieval model and human evaluation. In our experiments, we show that recent text-to-image models perform better in recognizing and counting objects than recognizing colors and understanding spatial relations, while there exists a large gap between model performances and oracle accuracy on all skills. Next, we demonstrate that recent text-to-image models learn specific gender/racial biases from web image-text pairs. We also show that our automatic evaluations of visual reasoning skills and gender bias are highly correlated with human judgments. We hope our work will help guide future progress in improving text-to-image models on visual reasoning skills and social biases. Code and data at: https://github.com/j-min/DallEval.
We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL·E, GENRE, and HTLM. We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL·E) and do captioning all in a zero-shot setting with a single model.
The text-guided diffusion model GLIDE (Guided Language to Image Diffusion for Generation and Editing) is the state of the art in text-to-image generative artificial intelligence (AI). GLIDE has rich representations, but medical applications of this model have not been systematically explored. If GLIDE had useful medical knowledge, it could be used for medical image analysis tasks, a domain in which AI systems are still highly engineered towards a single use-case.
Here we show that the publicly available GLIDE model has reasonably strong representations of key topics in cancer research and oncology, in particular the general style of histopathology images and multiple facets of diseases, pathological processes and laboratory assays. However, GLIDE seems to lack useful representations of the style and content of radiology data. [That is really very unsurprising…]
Our findings demonstrate that domain-agnostic generative AI models can learn relevant medical concepts without explicit training. Thus, GLIDE and similar models might be useful for medical image processing tasks in the future—particularly with additional domain-specific fine-tuning.
Conventional methods for the image-text generation tasks mainly tackle the naturally bidirectional generation tasks separately, focusing on designing task-specific frameworks to improve the quality and fidelity of the generated samples. Recently, Vision-Language Pre-training models have greatly improved the performance of the image-to-text generation tasks, but large-scale pre-training models for text-to-image synthesis task are still under-developed.
In this paper, we propose ERNIE-ViLG, an unified generative pre-training framework for bidirectional image-text generation with transformer model. Based on the image quantization models, we formulate both image generation and text generation as autoregressive generative tasks conditioned on the text/image input [cf. CogView, GLIDE, RuDOLPH, L-Verse, Huang et al 2021]. The bidirectional image-text generative modeling eases the semantic alignments across vision and language. For the text-to-image generation process, we further propose an end to end training method to jointly learn the visual sequence generator and the image reconstructor.
To explore the landscape of large-scale pre-training for bidirectional text-image generation, we train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs which achieves state-of-the-art performance for both text-to-image and image-to-text tasks, obtaining an FID of 7.9 on MS COCO for text-to-image synthesis and best results on COCO-CN and AIC-ICC for image captioning.
…The sources of our dataset are listed as follows:
Chinese Webpages: We crawl 800 million raw Chinese alt-text descriptions paired with images from various Chinese webpages, conduct several steps of filtering and totally harvest 70 million text-image pairs. The filtering rules mainly include: (1) Text-length: the number of words in alt-text is less than 15. (2) Text-content: the alt-text must contain at least one noun and contain no special characters. (3) Image-text similarity: the similarity score of between the alt-text and image (calculated by an in-house text-image matching model with the score range from 0.0 to 1.0) is greater than 0.5.
Image Search Engine: We collect roughly 60 million query texts and corresponding user-clicked images from our internal image search engine. There is often a strong correlation between the query and user-clicked images.
Public image-text Dataset: We collect a total of 15 million text-image pairs from two public datasets, CC30 and CC12M.40 The captions in these datasets are translated to Chinese through Baidu Translate API.
…Qualitative Results: ERNIE-ViLG has acquired the generation capability for various scenes, from basic objects to complex combinations of objects. Some examples are shown in Figure 4. As can be seen in the examples, ERNIE-ViLG can not only draw entities mentioned in the given text description, but also combine them together with the background in a reasonable way. Surprisingly, we also find two special skills ERNIE-ViLG develops. Firstly, ERNIE-ViLG can generate images of different styles by simply adding text prompts without fine-tuning like CogView does (Figure 5). Secondly, our model can generate realistic images given Chinese ancient poetry which shows promising understanding of brief and abstractive descriptions. Real concepts in the poetry are well-organized and artistic conception is well-described (Figure 6)
…We compare the results of our end-to-end training method with the two-stage pipeline baseline as shown in Table 7. For the two-stage pipeline, we train a text-to-image generator and use the decoder of dVAE directly as the reconstructor. The Two-stage G(R) refers to the separately trained generator(reconstructor), and the end-to-end G(R) refers to the end-to-end trained generator(reconstructor). Our end-to-end method achieves an important FID improvement of 1.5 compared to the two-stage pipeline. We find that combining the end-to-end trained generator (end-to-end G) and dVAE decoder (two-stage R) also brings a FID improvement of 0.9 compared to that of two-stage pipeline, but falls behind compared to the end-to-end methods. This indicates our proposed end-to-end method can improve both the performance of the generator (two-stage G & two-stage R vs end-to-end G & two-stage R) and the reconstructor (end-to-end G & two-stage R vs end-to-end G & end-to-end R).
We also input visual sequence of real images discretized by dVAE (gold image sequence) to the two reconstructors for comparison. Experimental results (the last two lines in Table 7) show that the end-to-end trained reconstructor has more obvious advantage in the reconstruction from real image discrete representation. We consider that end-to-end training will be more effective on ERNIE-ViLG with 10 billion parameters, for the image discrete representation generated by more capable generator is much closer to the real distribution, and hidden embeddings of larger model provides more useful features for the reconstructor. Due to the instability of the training of both GAN and large-scale generative model, we haven’t used end-to-end training for our 10-billion parameter model based on VQGAN. We will address the instability issue for future work and improve the 10-billion parameter ERNIE-ViLG through end-to-end training.
Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity.
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL·E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
We train a smaller model on a filtered dataset and release the code and weights at Github.
This technical report presents a text-to-image neural network “Emojich” that generates emojis using captions in Russian language as a condition. We aim to keep the generalization ability of a pretrained big model ruDALL·E Malevich (XL) 1.3B parameters at the fine-tuning stage, while giving special style to the images generated. Here are presented some engineering methods, code realization, all hyper-parameters for reproducing results and a Telegram bot where everyone can create their own customized sets of stickers. Also, some newly generated emojis obtained by “Emojich” model are demonstrated.
One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time/cost-consuming.
In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features.
Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs.
Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL·E model.
This paper presents a unified multimodal pre-trained model called NÜWA that can generate new or manipulate existing visual data (ie. images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is Github.
Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for text-to-image and image-to-text generation.
Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks.
In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial results of bidirectional vision-language representation learning on general domain.
Multi-modal language-vision models trained on hundreds of millions of image-text pairs (eg. CLIP, DALL·E) gained a recent surge, showing remarkable capability to perform zero-shot or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch.
To address this issue, in a community effort we build and release for public use LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings [clusters] and kNN indices that allow efficient similarity search.
Can visual artworks created using generative visual algorithms inspire human creativity in storytelling? We asked writers to write creative stories from a starting prompt, and provided them with visuals created by generative AI models from the same prompt. Compared to a control group, writers who used the visuals as story writing aid wrote statistically-significantly more creative, original, complete and visualizable stories, and found the task more fun. Of the generative algorithms used (BigGAN, VQGAN, DALL·E, CLIPDraw), VQGAN was the most preferred. The control group that did not view the visuals did significantly better in integrating the starting prompts. Findings indicate that cross modality inputs by AI can benefit divergent aspects of creativity in human-AI co-creation, but hinders convergent thinking.
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT’s FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online.
Although DALL·E has shown an impressive ability of composition-based systematic generalization in image generation, it requires the dataset of text-image pairs and the compositionality is provided by the text. In contrast, object-centric representation models like the Slot Attention model learn composable representations without the text prompt. However, unlike DALL·E its ability to systematically generalize for zero-shot generation is significantly limited. In this paper, we propose a simple but novel slot-based autoencoding architecture, called SLATE, for combining the best of both worlds: learning object-centric representations that allows systematic generalization in zero-shot image generation without text. As such, this model can also be seen as an illiterate DALL·E model. Unlike the pixel-mixture decoders of existing object-centric representation models, we propose to use the Image GPT decoder conditioned on the slots for capturing complex interactions among the slots and pixels. In experiments, we show that this simple and easy-to-implement architecture not requiring a text prompt achieves significant improvement in in-distribution and out-of-distribution (zero-shot) image generation and qualitatively comparable or better slot-attention structure than the models based on mixture decoders.
Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at 256×256 resolution, we achieve Inception Score (IS) of 175.1 and Fréchet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 72.2% for a similar model size. ViM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.
Current recommendation approaches help online merchants predict, for each visiting user, which subset of their existing products is the most relevant. However, besides being interested in matching users with existing products, merchants are also interested in understanding their users’ underlying preferences. This could indeed help them produce or acquire better matching products in the future.
We argue that existing recommendation models cannot directly be used to predict the optimal combination of features that will make new products serve better the needs of the target audience. To tackle this, we turn to generative models, which allow us to learn explicitly distributions over product feature combinations both in text and visual space. We develop WARHOL, a product generation and recommendation architecture that takes as input past user shopping activity and generates relevant textual and visual descriptions of novel products. We show that WARHOL can approach the performance of state-of-the-art recommendation models, while being able to generate entirely new products that are relevant to the given user profiles.
AI is undergoing a paradigm shift with the rise of models (eg. BERT, DALL·E 1, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.
This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (eg. language, vision, robotics, reasoning, human interaction) and technical principles (eg. model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (eg. law, healthcare, education) and societal impact (eg. inequity, misuse, economic and environmental impact, legal and ethical considerations).
Though foundation models are based on conventional deep learning and transfer learning, their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties.
To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
The history of AI is one of increasing emergence and homogenization. With the introduction of machine learning, we moved from a large proliferation of specialized algorithms that specified how to compute answers to a small number of general algorithms that learned how to compute answers (ie. the algorithm for computing answers emerged from the learning algorithm). With the introduction of deep learning, we moved from a large proliferation of hand-engineered features for learning algorithms to a small number of architectures that could be pointed at a new domain and discover good features for that domain. Recently, the trend has continued: we have moved from a large proliferation of trained models for different tasks to a few large “foundation models” which learn general algorithms useful for solving specific tasks. BERT and GPT-3 are central examples of foundation models in language; many NLP tasks that previously required different models are now solved using finetuned or prompted versions of BERT and/or GPT-3.
Note that, while language is the main example of a domain with foundation models today, we should expect foundation models to be developed in an increasing number of domains over time. The authors call these “foundation” models to emphasize that (1) they form a fundamental building block for applications and (2) they are not themselves ready for deployment; they are simply a foundation on which applications can be built. Foundation models have been enabled only recently because they depend on having large scale in order to make use of large unlabeled datasets using self-supervised learning to enable effective transfer to new tasks. It is particularly challenging to understand and predict the capabilities exhibited by foundation models because their multitask nature emerges from the large-scale training rather than being designed in from the start, making the capabilities hard to anticipate. This is particularly unsettling because foundation models also lead to substantially increased homogenization, where everyone is using the same few models, and so any new emergent capability (or risk) is quickly distributed to everyone.
The authors argue that academia is uniquely suited to study and understand the risks of foundation models. Foundation models are going to interact with society, both in terms of the data used to create them and the effects on people who use applications built upon them. Thus, analysis of them will need to be interdisciplinary; this is best achieved in academia due to the concentration of people working in the various relevant areas. In addition, market-driven incentives need not align well with societal benefit, whereas the research mission of universities is the production and dissemination of knowledge and creation of global public goods, allowing academia to study directions that would have large societal benefit that might not be prioritized by industry.
All of this is just a summary of parts of the introduction to the report. The full report is over 150 pages and goes into detail on capabilities, applications, technologies (including technical risks), and societal implications. I’m not going to summarize it here, because it is long and a lot of it isn’t that relevant to alignment; I’ll instead note down particular points that I found interesting.
(pg. 26) Some studies have suggested that foundation models in language don’t learn linguistic constructions robustly; even if they use it well once, they may not do so again, especially under distribution shift. In contrast, humans can easily “slot in” new knowledge into existing linguistic constructions.
(pg. 34) This isn’t surprising but is worth repeating: many of the capabilities highlighted in the robotics section are very similar to the ones that we focus on in alignment (task specification, robustness, safety, sample efficiency).
(pg. 42) For tasks involving reasoning (eg. mathematical proofs, program synthesis, drug discovery, computer-aided design), neural nets can be used to guide a search through a large space of possibilities. Foundation models could be helpful because (1) since they are very good at generating sequences, you can encode arbitrary actions (eg. in theorem proving, they can use arbitrary instructions in the proof assistant language rather than being restricted to an existing database of theorems), (2) the heuristics for effective search learned in one domain could transfer well to other domains where data is scarce, and (3) they could accept multimodal input: for example, in theorem proving for geometry, a multimodal foundation model could also incorporate information from geometric diagrams.
(Section 3) A substantial portion of the report is spent discussing potential applications of foundation models. This is the most in-depth version of this I have seen; anyone aiming to forecast the impacts of AI on the real world in the next 5–10 years should likely read this section. It’s notable to me how nearly all of the applications have an emphasis on robustness and reliability, particularly in truth-telling and logical reasoning.
(Section 4.3) We’ve seen a few (AN #152) ways (AN #155) in which foundation models can be adapted. This section provides a good overview of the various methods that have been proposed in the literature. Note that adaptation is useful not just for specializing to a particular task like summarization, but also for enforcing constraints, handling distributional shifts, and more.
(pg. 92) Foundation models are commonly evaluated by their performance on downstream tasks. One limitation of this evaluation paradigm is that it makes it hard to distinguish between the benefits provided by better training, data, adaptation techniques, architectures, etc. (The authors propose a bunch of other evaluation methodologies we could use.)
(Section 4.9) There is a review of AI safety and AI alignment as it relates to foundation models, if you’re interested. (I suspect there won’t be much new for readers of this newsletter.)
(Section 4.10) The section on theory emphasizes studying the pretraining-adaptation interface, which seems quite good to me. I especially liked the emphasis on the fact that pretraining and adaptation work on different distributions, and so it will be important to make good modeling assumptions about how these distributions are related.]
Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with the text references. This is different from human language processing, for which visual imaginations often improve comprehension. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of CLIP and DALL·E, two cross-modal models pre-trained on large-scale image-text pairs, we automatically generate an image as the embodied imagination for the text snippet and compute the imagination similarity using contextual embeddings. Experiments spanning several text generation tasks demonstrate that adding imagination with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics’ correlations with human similarity judgments in many circumstances.
[interview; online DT] We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling.
Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite the simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
…Decision Transformer: autoregressive sequence modeling for RL: We take a simple approach: each modality (return, state, or action) is passed into an embedding network (convolutional encoder for images, linear layer for continuous states). The embeddings are then processed by an autoregressive transformer model, trained to predict the next action given the previous tokens using a linear output layer. Evaluation is also easy: we can initialize by a desired target return (eg. 1 or 0 for success or failure) and the starting state in the environment. Unrolling the sequence—similar to standard autoregressive generation in language models—yields a sequence of actions to execute in the environment.
…Sequence modeling as multitask learning: One effect of this type of modeling is that we perform conditional generation, where we initialize a trajectory by inputting our desired return. Decision Transformer does not yield a single policy; rather, it models a wide distribution of policies. If we plot average achieved return against the target return of a trained Decision Transformer, we find distinct policies are learned that can reasonably match the target, trained only with supervised learning. Furthermore, on some tasks (such as Q✱bert and Seaquest), we find Decision Transformer can actually extrapolate outside of the dataset and model policies achieving higher return!
The simplicity of this version of the control codes or ‘inline metadata trick’ (eg. CTRL) means it can be reused with almost any generative model where some measure of quality or reward is available (even if only self-critique like likelihood of a sequence eg. in Meena-style best-of ranking or inverse prompting): you have an architecture floorplanDALL·E 1? Use standard architecture software to score plans by their estimated thermal efficiency/sunlight/etc; prefix these scores, retrain, & decode for good floorplans maximizing thermal efficiency/sunlight. You have a regular DALL·E 1? Sample n samples per prompt, CLIP-rank the images, prefix their ranking, retrain… No useful CLIP? Then use the CogView self-text-captioning trick to turn generated images back into text, rank by text likelihood… Choose Your Own Adventure AI Dungeon game-tree? Rank completions by player choice, feed back in for preference learning… All of the work is done by the data, as long as the generative model is smart enough.]
[WP] The Beijing Academy of Artificial Intelligence, styled as BAAI and known in Chinese as 北京智源人工智能研究院, launched the latest version of Wu Dao 悟道, a pre-trained deep learning model that the lab dubbed as “China’s first”, and “the world’s largest ever”, with a whopping 1.75 trillion parameters.
…Unlike conventional deep learning models that are usually task-specific, Wu Dao is a multi-modal model trained to tackle both text and image, 2 dramatically different sets of problems. At BAAI’s annual academic conference on Tuesday, the institution demonstrated Wu Dao performing tasks such as natural language processing, text generation, image recognition, image generation, etc.
The model is capable of writing poems and couplets in the traditional Chinese styles, answer questions, write essays, generate alt text for images, and generate corresponding images from natural language description with a decent level of photorealism. It is even able to power “virtual idols”, with the help of Xiaoice, a Chinese company spun off of Microsoft—so there can be voice support too, in addition to text and image.
…Very interestingly, this model with 1.75 trillion parameters is already the 2.0 version of Wu Dao, whose first version was just launched less than 3 months ago. One of the main reasons the Chinese researchers made progress quickly was that they were able to tap into China’s supercomputing clusters, with the help of a few of its core members who also worked on the national supercomputing projects.
A little more technical explanation: BAAI researchers developed and open-sourced a deep learning system called FastMoE, which allowed Wu Dao to be trained on both supercomputers and regular GPUs with substantially more parameters, giving the model, in theory, more flexibility than Google’s take on the MoE, or Mixture-of-Experts. This is because Google’s system requires the company’s dedicated TPU hardware and distributed training framework, while BAAI’s FastMoE works with at least one industry-standard open-source framework, namely PyTorch, and can be operated on off-the-shelf hardware.
The Chinese lab claims that Wu Dao’s sub-models achieved better performance than previous models, beating OpenAI’s CLIP and Google’s ALIGN on English image and text indexing in the Microsoft COCO dataset. For image generation from text, a novel task, BAAI claims that Wu Dao’s sub-model CogView beat OpenAI’s DALL·E 1, a state-of-the-art neural network launched in January this year with 12 billion parameters.
“The way to artificial general intelligence is big models and big computer”, said Dr. Zhang Hongjiang, chairman of BAAI, “What we are building is a power plant for the future of AI, with mega data, mega computing power, and mega models, we can transform data to fuel the AI applications of the future.”
…However, while OpenAI and DeepMind are privately funded, a key distinction for BAAI is that it’s formed and funded with substantial help from China’s Ministry of Science and Technology, as well as Beijing’s municipal government.
Conditional image synthesis aims to create an image according to some multi-modal guidance in the forms of textual descriptions, reference images, and image blocks to preserve, as well as their combinations. In this paper, instead of investigating these control signals separately, we propose a new two-stage architecture, M6-UFC, to unify any number of multi-modal controls. In M6-UFC, both the diverse control signals and the synthesized image are uniformly represented as a sequence of discrete tokens to be processed by Transformer. Different from existing two-stage autoregressive approaches such as DALL·E and VQGAN, M6-UFC adopts non-autoregressive generation (NAR) at the second stage to enhance the holistic consistency of the synthesized image, to support preserving specified image blocks, and to improve the synthesis speed. Further, we design a progressive algorithm that iteratively improves the non-autoregressively generated image, with the help of two estimators developed for evaluating the compliance with the controls and evaluating the fidelity of the synthesized image, respectively. Extensive experiments on a newly collected large-scale clothing dataset M2C-Fashion and a facial dataset Multi-Modal CelebA-HQ verify that M6-UFC can synthesize high-fidelity images that comply with flexible multi-modal controls.
[Affiliations: Tsinghua University, DAMO Academy, AliBaba Group, BAAI; CogView2] Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding.
We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, eg. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, eg. eliminating NaN losses.
…3.3 Image Captioning and Self-reranking: To finetune CogView for image captioning is straightforward: exchanging the order of text and image tokens in the input sequences [and then finetune-train the model on the new flipped data training corpus ie ‘analysis by synthesis’]. Since the model has already learnt the corresponding relationships between text and images, reversing the generation is not hard. We did not evaluate the performance due to that (1) there is no authoritative Chinese image captioning benchmark (2) image captioning is not the focus of this work. The main purpose of finetuning such a model is for self-reranking. We propose the Caption Score (CapS) to evaluate the correspondence between images and text…this method can be seen as an adaptation of inverse prompting53 for text-to-image generation. Finally, images with the highest CapS are chosen. [cf. self-distillation, CLIP ranking]
Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation. Existing works typically experiment on simple or small datasets, where the generalization ability is quite limited.
In this work, we propose GODIVA, an open-domain text-to-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism. We pretrain our model on Howto100M, a large-scale text-video dataset that contains more than 136 million text-video pairs. Experiments show that GODIVA not only can be fine-tuned on downstream video generation tasks, but also has a good zero-shot capability on unseen texts.
We also propose a new metric called Relative Matching (RM) to automatically evaluate the video generation quality. Several challenges are listed and discussed as future work.
We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos.
VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings.
Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF101 and Tumbler GIF Dataset (TGIF).
[Fun note: the corpus uses The Pile.] In a bid to promote the research and development of China’s own large-scale pretraining models and further explore universal intelligence from a more fundamental perspective, the Beijing Academy of Artificial Intelligence (BAAI) recently unveiled Wu Dao 1.0, China’s first homegrown super-scale intelligent model system. The work was led by BAAI Research Academic Vice President and Tsinghua University Professor Tang Jie, with contributions from a team of more than 100 AI scientists from Peking University, Tsinghua University, Renmin University of China, Chinese Academy of Sciences and other institutes.
Wu Dao 1.0 has initiated large-scale research projects via 4 related models: Wu Dao—Wen Yuan, Wu Dao—Wen Lan, Wu Dao—Wen Hui, and Wu Dao—Wen Su.
Wu Dao—Wen Yuan: is China’s largest-ever pretraining language model, boasting the best processing power in mainstream languages, including Chinese and English. It has surpassed average human performance benchmarks on text categorization, sentiment analysis, natural language inference, reading comprehension and more. The Wu Dao—Wen Yuan project is designed to explore universal natural language understanding (NLU) techniques and study brain-inspired language models. It has 2.6 billion parameters and is capable of performing cognitive activities such as memorization, comprehension, retrieval, numerical calculation, multi-language, etc. Wu Dao—Wen Yuan has achieved GPT-3 comparable performance on 20 Chinese NLP tasks such as open-domain answering, grammar correction, sentiment analysis, etc.
…Wen Yuan introduces the open-source Chinese pretraining model (CPM). Based on CPM, the CPM-Distill model reduces language confusion by 38% and achieves better results on downstream tasks.
Wu Dao—Wen Lan: meanwhile, is the first publicly available Chinese universal graphic multimodal pretraining model. The ultra-large-scale multimodal pretraining model aims to break through the theoretical challenges of pretraining multimodal data based on a combination of graphics, text and video, and eventually generate industrial-grade Chinese graphics pretraining models and applications that exceed SOTA performance. Currently, the model has 1 billion parameters and is trained on 50 million graphic pairs collected from open sources. The Wu Dao—Wen Lan model has reached SOTA performance, scoring 5% higher than the champion team on the Image Caption task on the Chinese public multimodal test set AIC-ICC and 20% higher than the most popular UNITER model on the Visual Entailment task.
…Wen Lan is the first Chinese generic multimodal pretraining model that can understand “connotative information” based on weak correlations of images and text. Wen Lan uses an advanced cross-modal contrast learning algorithm: Given an image-text pair, it can enlarge the number of negative samples for each modal, especially for those which are difficult to distinguish, further improving the expression ability of neural networks. It can easily replace image and text encoders with the most advanced single-mode pretraining model, achieving 20× faster performance than the UNITER model.
Wu Dao—Wen Hui: is an ultra-large-scale cognitive-oriented pretraining model that focuses on a series of essential problems in general artificial intelligence from a cognitive perspective, aiming to develop and enhance the logic/consciousness/reasoning-based cognitive capabilities of pretraining models. Wu Dao—Wen Hui has reached 11.3 billion parameters, and through simple fine-tuning can generate poetry, make videos, draw pictures, retrieve text, perform complex reasoning, etc. BAAI says the model achieves near-human performance on poetry generation on the Turing test.
…Wen Hui proposes a new pretraining paradigm, Generative Language Model, breaking the bottlenecks of BERT and GPT. For the first time in history, a single model has achieved the best results in language understanding and generating tasks, and surpassed common pretraining models such as BERT, RoBERTa and T5 that trained on the same volume of data. Wen Hui’s continuous vector based fine-tuning method, P-tuning, is the first autoregressive model that surpasses the autoencoder model in NLU tasks and has achieved SOTA results on more than 10 tasks such as Knowledge Extraction and SuperGLUE Few-shot Learning, with over 20% performance improvement. Wen Hui’s inverse prompting algorithm achieves close to human performance on the task of Q&A and poetry generation, and is the first model that can generate classical Chinese poetry based on modern themes.
Wu Dao—Wen Su: is a large-scale training model for biomolecular structure prediction. It can handle super long biomolecular structures, where it has achieved SOTA performance, interpretability and robustness. Based on Google’s BERT language model, Wu Dao—Wen Su has completed protein training on the 100 GB UNIPARC database and gene training on 5–100,000 human peripheral blood immune cells (25–30 cell types) and 10,000 drug-resistant bacteria.
…Wen Su’s open-sourced FastMoE is the first high-performance MoE (Mixture-of-Experts Model) system that supports the PyTorch framework and a variety of hardware. Only one line of code is required to complete the MoE transformation, and model training speed is increased by 47× compared with the traditional PyTorch implementation.
“Paint by Word”, David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, Antonio Torralba (2021-03-19; CLIP; similar):
We investigate the problem of zero-shot semantic image painting. Instead of painting modifications into an image using only concrete colors or a finite set of semantic concepts, we ask how to create semantic paint based on open full-text descriptions: our goal is to be able to point to a location in a synthesized image and apply an arbitrary new concept such as “rustic” or “opulent” or “happy dog.” To do this, our method combines a state-of-the art generative model of realistic images with a state-of-the-art text-image semantic similarity network. We find that, to make large changes, it is important to use non-gradient methods to explore latent space, and it is important to relax the computations of the GAN to target changes to a specific region. We conduct user studies to compare our methods to several baselines.
The high dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models. Previous approaches such as VQ-VAE use deep autoencoders to obtain compact representations, which are more practical as inputs for likelihood-based models. We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks, which are represented sparsely as a sequence of DCT channel, spatial location, and DCT coefficient triples. We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences, and which scales effectively to high resolution images. On a range of image datasets, we demonstrate that our approach can generate high quality, diverse images, with sample metric scores competitive with state of the art methods. We additionally show that simple modifications to our method yield effective image colorization and super-resolution models.
In this work, we construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide range of domains.
We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer, for unified pretraining on the data of single modality and multiple modalities. We scale the model size up to 10 billion and 100 billion parameters, and build the largest pretrained model in Chinese.
We apply the model to a series of downstream applications, and demonstrate its outstanding performance in comparison with strong baselines. Furthermore, we specifically design a downstream task of text-guided image generation, and show that the finetuned M6 can create high-quality images with high resolution and abundant details.
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
[Paper: “DALL·E 1: Zero-Shot Text-to-Image Generation”, Ramesh et al 2021. Re-implementation: DALL·E 1 Mini (writeup). cf. CogView, Wu Dao. Availability through OA API still planned as of 2021-09-05; see DALL·E 2.] DALL·E 1 is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.
GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. iGPT showed that the same type of neural network can also be used to generate images with high fidelity. [iGPT is another answer to the question of “how do we do images autoregressively, but not at the exorbitant cost of generating pixels 1 by 1?”; iGPT uses ‘super pixels’ & very small images, while DALL·E 1 uses VAE ‘tokens’ corresponding roughly to small squares so the token sequence is relatively small, where the VAE does the actual compilation to raw pixels.] we extend these findings to show that manipulating visual concepts through language is now within reach.
DALL·E 1’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16,384, and the image is represented using 1,024 tokens with a vocabulary size of 8,192. The images are preprocessed to 256×256 resolution during training. Similar to VQ-VAE,1415 each image is compressed to a 32×32 grid of discrete latent codes using a discrete VAE1011 that we pretrained using a continuous relaxation.1213 We found that training using the relaxation obviates the need for an explicit codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes.
…Capabilities: We find that DALL·E 1 is able to create plausible images for a great variety of sentences that explore the compositional structure of language. We illustrate this using a series of interactive visuals in the next section. The samples shown for each caption in the visuals are obtained by taking the top 32 of 512 after reranking with CLIP [cf. CLIPScore, CogView’s caption scores], but we do not use any manual cherry-picking, aside from the thumbnails and standalone images that appear outside.
Controlling attributes: We test DALL·E 1’s ability to modify several of an object’s attributes, as well as the number of times that it appears.
Drawing multiple objects
Visualizing perspective and three-dimensionality
Visualizing internal and external structure
Inferring contextual details
…With varying degrees of reliability, DALL·E 1 provides access to a subset of the capabilities of a 3D rendering engine via natural language. It can independently control the attributes of a small number of objects, and to a limited extent, how many there are, and how they are arranged with respect to one another. It can also control the location and angle from which a scene is rendered, and can generate known objects in compliance with precise specifications of angle and lighting conditions.
Zero-shot visual reasoning: GPT-3 can be instructed to perform many kinds of tasks solely from a description and a cue to generate the answer supplied in its prompt, without any additional training. For example, when prompted with the phrase “here is the sentence ‘a person walking his dog in the park’ translated into French:”, GPT-3 answers “un homme qui promène son chien dans le parc.” This capability is called zero-shot reasoning. We find that DALL·E 1 extends this capability to the visual domain, and is able to perform several kinds of image-to-image translation tasks when prompted in the right way. [See also CLIP.]
We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it. Motivated by these results, we measure DALL·E 1’s aptitude for analogical reasoning problems by testing it on Raven’s Progressive Matrices, a visual IQ test that saw widespread use in the 20th century. Rather than treating the IQ test a multiple-choice problem as originally intended, we ask DALL·E 1 to complete the bottom-right corner of each image using argmax sampling, and consider its completion to be correct if it is a close visual match to the original. DALL·E 1 is often able to solve matrices that involve continuing simple patterns or basic geometric reasoning, such as those in sets B and C. It is sometimes able to solve matrices that involve recognizing permutations and applying boolean operations, such as those in set D. The instances in set E tend to be the most difficult, and DALL·E 1 gets almost none of them correct. For each of the sets, we measure DALL·E 1’s performance on both the original images, and the images with the colors inverted. The inversion of colors should pose no additional difficulty for a human, yet does generally impair DALL·E 1’s performance, suggesting its capabilities may be brittle in unexpected ways.
Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images.
We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (1) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (2) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet.
Code and pretrained models can be found at Github.
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used to select and position masks to generate a fully covered segmentation canvas; the final image is produced by a segmentation-to-image generator using this canvas. This multi-step, retrieval-based approach outperforms existing direct text-to-image generation models on both automatic metrics and human evaluations: overall, its generated images are more photo-realistic and better match descriptions.
Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family—LXMERT—finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. X-LXMERT’s image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT. Finally, we demonstrate the generality of these training refinements by adding image generation capabilities into UNITER to produce X-UNITER.
…By establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting.
Transformer models like BERT and GPT-2 are domain agnostic, meaning that they can be directly applied to 1-D sequences of any form. When we train GPT-2 on images unrolled into long sequences of pixels, which we call iGPT, we find that the model appears to understand 2-D image characteristics such as object appearance and category. This is evidenced by the diverse range of coherent image samples it generates, even without the guidance of human provided labels. As further proof, features from the model achieve state-of-the-art performance on a number of classification datasets and near state-of-the-art unsupervised accuracy on ImageNet…we deliberately use the same transformer architecture as GPT-2 in language. As a consequence, we require substantially more compute in order to produce features competitive with those from top unsupervised convolutional nets…Generative sequence modeling is an universal unsupervised learning algorithm: since all data types can be represented as sequences of bytes, a transformer can be directly applied to any data type without additional engineering. Our work tests the power of this generality by directly applying the architecture used to train GPT-2 on natural language to image generation. We deliberately chose to forgo hand coding any image specific knowledge in the form of convolutions38 or techniques like relative attention,39 sparse attention,40 and 2-D position embeddings.27
…We train iGPT-S, iGPT-M, and iGPT-L, transformers containing 76M, 455M, and 1.4B parameters respectively, on ImageNet. We also train iGPT-XL4, a 6.8 billion parameter transformer, on a mix of ImageNet and images from the web. Due to the large computational cost of modeling long sequences with dense attention, we train at the low resolutions of 32×32, 48×48, and 64×64…Our next result establishes the link between generative performance and feature quality. We find that both increasing the scale of our models and training for more iterations result in better generative performance, which directly translates into better feature quality.
…When we evaluate our features using linear probes on CIFAR-10, CIFAR-100, and STL-10, we outperform features from all supervised and unsupervised transfer algorithms. Our results are also compelling in the full fine-tuning setting
…Because we use the generic sequence transformer used for GPT-2 in language, our method requires large amounts of compute: iGPT-L was trained for roughly 2500 V100-days while a similarly performing MoCo24 model can be trained in roughly 70 V100-days…We have shown that by trading off 2-D knowledge for scale60 and by choosing predictive features from the middle of the network, a sequence transformer can be competitive with top convolutional nets for unsupervised image classification. Notably, we achieved our results by directly applying the GPT-2 language model to image generation. Our results suggest that due to its simplicity and generality, a sequence transformer given sufficient compute might ultimately be an effective way to learn excellent features in many domains.
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images.
We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure.
Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving 72.0% top-1 accuracy on a linear probe of our features.
There are two prevailing technical theories about what it will take to reach AGI. In one, all the necessary techniques already exist; it’s just a matter of figuring out how to scale and assemble them. In the other, there needs to be an entirely new paradigm; deep learning, the current dominant technique in AI, won’t be enough. Most researchers fall somewhere between these extremes, but OpenAI has consistently sat almost exclusively on the scale-and-assemble end of the spectrum. Most of its breakthroughs have been the product of sinking dramatically greater computational resources into technical innovations developed in other labs.
Brockman and Sutskever deny that this is their sole strategy, but the lab’s tightly guarded research suggests otherwise. A team called “Foresight” runs experiments to test how far they can push AI capabilities forward by training existing algorithms with increasingly large amounts of data and computing power. For the leadership, the results of these experiments have confirmed its instincts that the lab’s all-in, compute-driven strategy is the best approach. For roughly six months, these results were hidden from the public because OpenAI sees this knowledge as its primary competitive advantage. Employees and interns were explicitly instructed not to reveal them, and those who left signed nondisclosure agreements. It was only in January that the team, without the usual fanfare, quietly posted a paper on one of the primary open-source databases for AI research. People who experienced the intense secrecy around the effort didn’t know what to make of this change. Notably, another paper with similar results from different researchers had been posted a month earlier.
…One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources. A small team has been assigned to the initial effort, with an expectation that other teams, along with their work, will eventually fold in. On the day it was announced at an all-company meeting, interns weren’t allowed to attend. People familiar with the plan offer an explanation: the leadership thinks this is the most promising way to reach AGI. [See DALL·E 1, CLIP.]
…The man driving OpenAI’s strategy is Dario Amodei, the ex-Googler who now serves as research director. When I meet him, he strikes me as a more anxious version of Brockman. He has a similar sincerity and sensitivity, but an air of unsettled nervous energy. He looks distant when he talks, his brows furrowed, a hand absentmindedly tugging his curls. Amodei divides the lab’s strategy into two parts. The first part, which dictates how it plans to reach advanced AI capabilities, he likens to an investor’s “portfolio of bets.” Different teams at OpenAI are playing out different bets. The language team, for example, has its money on a theory postulating that AI can develop a substantial understanding of the world through mere language learning. The robotics team, in contrast, is advancing an opposing theory that intelligence requires a physical embodiment to develop. As in an investor’s portfolio, not every bet has an equal weight. But for the purposes of scientific rigor, all should be tested before being discarded. Amodei points to GPT-2, with its remarkably realistic auto-generated texts, as an instance of why it’s important to keep an open mind. “Pure language is a direction that the field and even some of us were somewhat skeptical of”, he says. “But now it’s like, ‘Wow, this is really promising.’” Over time, as different bets rise above others, they will attract more intense efforts. Then they will cross-pollinate and combine. The goal is to have fewer and fewer teams that ultimately collapse into a single technical direction for AGI. This is the exact process that OpenAI’s latest top-secret project has supposedly already begun.
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images [3.3m training] than the MS-COCO dataset (Lin et al 2014) and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages.
We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNetv2 (Szegedy et al 2016) for image-feature extraction and Transformer (Vaswani et al 2017) for sequence modeling achieves the best performance when trained on the Conceptual Captions dataset.
Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By restricting the self-attention mechanism to attend to local neighborhoods we significantly increase the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. While conceptually simple, our generative models significantly outperform the current state of the art in image generation on ImageNet, improving the best published negative log-likelihood on ImageNet from 3.83 to 3.77. We also present results on image super-resolution with a large magnification ratio, applying an encoder-decoder configuration of our architecture. In a human evaluation study, we find that images generated by our super-resolution model fool human observers three times more often than the previous state of the art.
Learning useful representations without supervision remains a key challenge in machine learning.
In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantized-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of “posterior collapse”—where the latents are ignored when they are paired with a powerful autoregressive decoder—typically observed in the VAE framework.
Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the non-differentiable sample from a categorical distribution with a differentiable sample from a novel Gumbel-Softmax distribution. This distribution has the essential property that it can be smoothly annealed into a categorical distribution. We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised generative modeling tasks with categorical latent variables, and enables large speedups on semi-supervised classification.
The reparameterization trick enables optimizing large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss. While many continuous random variables have such reparameterizations, discrete random variables lack useful reparameterizations due to the discontinuous nature of discrete states. In this work we introduce Concrete random variables—continuous relaxations of discrete random variables. The Concrete distribution is a new family of distributions with closed form densities and a simple reparameterization. Whenever a discrete stochastic node of a computation graph can be refactored into an one-hot bit representation that is treated continuously, Concrete stochastic nodes can be used with automatic differentiation to produce low-variance biased gradients of objectives (including objectives that depend on the log-probability of latent stochastic nodes) on the corresponding discrete graph. We demonstrate the effectiveness of Concrete relaxations on density estimation and structured prediction tasks using neural networks.