Skip to main content

AI/​video/​generation directory


“Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer”, Ge et al 2022

“Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer”⁠, Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, Devi Parikh (2022-04-07; ; similar):

Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames’ quality and the transitions between them, while little progress has been made in generating longer videos.

In this paper, we present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.

Our evaluation shows that our model trained on 16-frame video clips from standard benchmarks such as UCF101⁠, Sky Time-lapse, and Taichi-HD datasets can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.

Videos and code can be found at⁠.

“Video Diffusion Models”, Ho et al 2022

“Video Diffusion Models”⁠, Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet (2022-04-07; ; similar):

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark. Supplementary material is available at

“Reinforcement Learning With Action-Free Pre-Training from Videos”, Seo et al 2022

“Reinforcement Learning with Action-Free Pre-Training from Videos”⁠, Younggyo Seo, Kimin Lee, Stephen James, Pieter Abbeel (2022-03-25; ⁠, ):

Recent unsupervised pre-training methods have shown to be effective on language and vision domains by learning useful representations for multiple downstream tasks. In this paper, we investigate if such unsupervised pre-training methods can also be effective for vision-based reinforcement learning (RL).

To this end, we introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos. Our framework consists of two phases: we pre-train an action-free latent video prediction model, and then utilize the pre-trained representations for efficiently learning action-conditional world models on unseen environments. To incorporate additional action inputs during fine-tuning, we introduce a new architecture that stacks an action-conditional latent prediction model on top of the pre-trained action-free prediction model.

Moreover, for better exploration, we propose a video-based intrinsic bonus that leverages pre-trained representations. We demonstrate that our framework significantly improves both final performances and sample-efficiency of vision-based RL in a variety of manipulation and locomotion tasks. Code is available at Github⁠.

“Transframer: Arbitrary Frame Prediction With Generative Models”, Nash et al 2022

“Transframer: Arbitrary Frame Prediction with Generative Models”⁠, Charlie Nash, João Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, Peter Battaglia et al (2022-03-17):

We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation⁠, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is the state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30 second videos from a single image without any explicit geometric information. A single generalist Transframer simultaneously produces promising results on 8 tasks, including semantic segmentation, image classification and optical flow prediction with no task-specific architectural components, demonstrating that multi-task computer vision can be tackled using probabilistic image models. Our approach can in principle be applied to a wide range of applications that require learning the conditional structure of annotated image-formatted data.

“Diffusion Probabilistic Modeling for Video Generation”, Yang et al 2022

“Diffusion Probabilistic Modeling for Video Generation”⁠, Ruihan Yang, Prakhar Srivastava, Stephan Mandt (2022-03-16; ; similar):

Denoising diffusion probabilistic models are a promising new class of generative models that are competitive with GANs on perceptual metrics.

In this paper, we explore their potential for sequentially generating video. Inspired by recent advances in neural video compression, we use denoising diffusion models to stochastically generate a residual to a deterministic next-frame prediction.

We compare this approach to two sequential VAE and two GAN baselines on four datasets, where we test the generated frames for perceptual quality and forecasting accuracy against ground truth frames. We find substantial improvements in terms of perceptual quality on all data and improvements in terms of frame forecasting for complex high-resolution videos.

“Microdosing: Knowledge Distillation for GAN Based Compression”, Helminger et al 2022

“Microdosing: Knowledge Distillation for GAN based Compression”⁠, Leonhard Helminger, Roberto Azevedo, Abdelaziz Djelouah, Markus Gross, Christopher Schroers (2022-01-07; ⁠, ; similar):

Recently, significant progress has been made in learned image and video compression. In particular the usage of Generative Adversarial Networks has lead to impressive results in the low bit rate regime. However, the model size remains an important issue in current state-of-the-art proposals and existing solutions require significant computation effort on the decoding side. This limits their usage in realistic scenarios and the extension to video compression. In this paper, we demonstrate how to leverage knowledge distillation to obtain equally capable image decoders at a fraction of the original number of parameters. We investigate several aspects of our solution including sequence specialization with side information for image coding. Finally, we also show how to transfer the obtained benefits into the setting of video compression. Overall, this allows us to reduce the model size by a factor of 20 and to achieve 50% reduction in decoding time.

“StyleGAN-V: A Continuous Video Generator With the Price, Image Quality and Perks of StyleGAN2”, Skorokhodov et al 2021

“StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2”⁠, Ivan Skorokhodov, Sergey Tulyakov, Mohamed Elhoseiny (2021-12-29; ; similar):

Videos show continuous events, yet most—if not all—video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be—time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator.

For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image and video discriminators pair and propose to use a single hypernetwork-based one. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 10242 videos for the first time.

We build our model on top of StyleGAN2 and it is just 5% more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model achieves state-of-the-art results on four modern 2562 video synthesis benchmarks and one 10242 resolution one. Videos and the source code are available at the project website:⁠.

“U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI Threatens a New Arms Race”, Smith 2021

“U.S. vs. China Rivalry Boosts Tech—and Tensions: Militarized AI threatens a new arms race”⁠, Craig S. Smith (2021-12-28; ; similar):

…A year later, with much less fanfare, Tsinghua University’s Beijing Academy of Artificial Intelligence released an even larger model, Wu Dao 2.0⁠, with 10× as many parameters—the neural network values that encode information. While GPT-3 boasts 175 billion parameters, Wu Dao 2.0’s creators claim it has a whopping 1.75 trillion. Moreover, the model is capable not only of generating text like GPT-3 does but also images from textual descriptions like OpenAI’s 12-billion parameter DALL·E model, and has a similar scaling strategy to Google’s 1.6 trillion-parameter Switch Transformer model.

Tang Jie⁠, the Tsinghua University professor leading the Wu Dao project, said in a recent interview that the group built an even bigger, 100 trillion-parameter model in June, though it has not trained it to “convergence”, the point at which the model stops improving. “We just wanted to prove that we have the ability to do that”, Tang said…Tang says his group is now working on video with the goal of generating realistic video from text descriptions. “Hopefully, we can make this model do something beyond the Turing test”, he says, referring to an assessment of whether a computer can generate text indistinguishable from that created by a human. “That’s our final goal.”

Geoffrey Hinton instead helped to put deep learning on the map in 2012 with a now-famous neural net called AlexNet when he was at the University of Toronto. But Hinton was also in close contact with the Microsoft Research Lab in Redmond, Wash., before and after his group validated AlexNet, according to one of Hinton’s associates there, Li Deng⁠, then principal researcher and manager and later chief scientist of AI at Microsoft.

In 2009 and 2010, Hinton and Deng worked together at Microsoft on speech recognition and Deng, then Editor-In-Chief of the IEEE Signal Processing Magazine, was invited in 2011 to lecture at several academic organizations in China where he said he shared the published success of deep learning in speech processing. Deng said he was in close contact with former Microsoft colleagues at Baidu⁠, a Chinese search engine and AI giant, and a company called iFlyTek⁠, a spin off from Deng’s undergraduate alma mater.

When Hinton achieved his breakthrough with backpropagation in neural networks in 2012, he sent an email to Deng in Washington, and Deng said he shared it with Microsoft executives, including Qi Lu who led the development of the company’s search engine, Bing⁠. Deng said he also sent a note to his friends at iFlyTek, which quickly adopted the strategy and became an AI powerhouse—famously demonstrated in 2017 with a convincing video of then-president Donald Trump speaking Chinese⁠.

Qi Lu went on to become COO of Baidu where Deng said another Microsoft alum, Kai Yu⁠, who also knew Hinton well, had already seized on Hinton’s breakthrough. Literally within hours of Hinton’s results, according to Deng, researchers in China were working on repeating his success.

“Advances in Neural Rendering”, Tewari et al 2021

“Advances in Neural Rendering”⁠, Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul Srinivasan, Edgar Tretschk, Yifan Wang, Christoph Lassner et al (2021-11-10):

Synthesizing photo-realistic images and videos is at the heart of computer graphics and has been the focus of decades of research. Traditionally, synthetic images of a scene are generated using rendering algorithms such as rasterization or ray tracing, which take specifically defined representations of geometry and material properties as input. Collectively, these inputs define the actual scene and what is rendered, and are referred to as the scene representation (where a scene consists of one or more objects). Example scene representations are triangle meshes with accompanied textures (eg. created by an artist), point clouds (eg. from a depth sensor), volumetric grids (eg. from a CT scan), or implicit surface functions (eg. truncated signed distance fields). The reconstruction of such a scene representation from observations using differentiable rendering losses is known as inverse graphics or inverse rendering. Neural rendering is closely related, and combines ideas from classical computer graphics and machine learning to create algorithms for synthesizing images from real-world observations. Neural rendering is a leap forward towards the goal of synthesizing photo-realistic image and video content. In recent years, we have seen immense progress in this field through hundreds of publications that show different ways to inject learnable components into the rendering pipeline. This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene. In addition to methods that handle static scenes, we cover neural scene representations for modeling non-rigidly deforming objects…

“Learning a Perceptual Manifold With Deep Features for Animation Video Resequencing”, Morace et al 2021

“Learning a perceptual manifold with deep features for animation video resequencing”⁠, Charles C. Morace, Thi-Ngoc-Hanh Le, Sheng-Yi Yao, Shang-Wei Zhang, Tong-Yee Lee (2021-11-02; ; similar):

We propose a novel deep learning framework for animation video resequencing. Our system produces new video sequences by minimizing a perceptual distance of images from an existing animation video clip. To measure perceptual distance, we utilize the activations of convolutional neural networks and learn a perceptual distance by training these features on a small network with data comprised of human perceptual judgments. We show that with this perceptual metric and graph-based manifold learning techniques, our framework can produce new smooth and visually appealing animation video results for a variety of animation video styles. In contrast to previous work on animation video resequencing, the proposed framework applies to wide range of image styles and does not require hand-crafted feature extraction, background subtraction, or feature correspondence. In addition, we also show that our framework has applications to appealing arrange unordered collections of images.

“Autoregressive Latent Video Prediction With High-Fidelity Image Generator”, Seo et al 2021

“Autoregressive Latent Video Prediction with High-Fidelity Image Generator”⁠, Younggyo Seo, Kimin Lee, Fangchen Liu, Stephen James, Pieter Abbeel (2021-10-05; similar):

Video prediction is an important yet challenging problem; burdened with the tasks of generating future frames and learning environment dynamics. Recently, autoregressive latent video models have proved to be a powerful video prediction tool, by separating the video prediction into two sub-problems: pre-training an image generator model, followed by learning an autoregressive prediction model in the latent space of the image generator. However, successfully generating high-fidelity and high-resolution videos has yet to be seen. In this work, we investigate how to train an autoregressive latent video prediction model capable of predicting high-fidelity future frames with minimal modification to existing models, and produce high-resolution (256×256) videos. Specifically, we scale up prior models by employing a high-fidelity image generator (VQ-GAN) with a causal transformer model, and introduce additional techniques of top-k sampling and data augmentation to further improve video prediction quality. Despite the simplicity, the proposed method achieves competitive performance to state-of-the-art approaches on standard video prediction benchmarks with fewer parameters, and enables high-resolution video prediction on complex and large-scale datasets. Videos are available at the anonymized website

[Keywords: video prediction, autoregressive models]

“FitVid: Overfitting in Pixel-Level Video Prediction”, Babaeizadeh et al 2021

“FitVid: Overfitting in Pixel-Level Video Prediction”⁠, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, Dumitru Erhan (2021-06-24; ):

An agent that is capable of predicting what happens next can perform a variety of tasks through planning with no additional training. Furthermore, such an agent can internally represent the complex dynamics of the real-world and therefore can acquire a representation useful for a variety of visual perception tasks. This makes predicting the future frames of a video, conditioned on the observed past and potentially future actions, an interesting task which remains exceptionally challenging despite many recent advances. Existing video prediction models have shown promising results on simple narrow benchmarks but they generate low quality predictions on real-life datasets with more complicated dynamics or broader domain. There is a growing body of evidence that underfitting on the training data is one of the primary causes for the low quality predictions. In this paper, we argue that the inefficient use of parameters in the current video models is the main reason for underfitting. Therefore, we introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks while having similar parameter count as the current state-of-the-art models. We analyze the consequences of overfitting, illustrating how it can produce unexpected outcomes such as generating high quality output by repeating the training data, and how it can be mitigated using existing image augmentation techniques. As a result, FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.

“Alias-Free Generative Adversarial Networks”, Karras et al 2021

“Alias-Free Generative Adversarial Networks”⁠, Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, Timo Aila (2021-06-23; ; backlinks; similar):

[Github] We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, eg. detail appearing to be glued to image coordinates instead of the surfaces of depicted objects.

We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process.

The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.

G. Energy Consumption: The entire project consumed ~225 megawatt hours (MWh) of electricity. ~70% of it was used for exploratory runs, where we gradually built the new configurations; first in an unstructured manner and then specifically ironing out the new Alias-Free-T and Alias-Free-R configurations. Setting up the intermediate configurations between StyleGAN2 and our generators, as well as, the key parameter ablations was also quite expensive at ~15%. Training a single instance of Alias-Free-R at 1024×1024 is only slightly more expensive (0.9MWh) than training StyleGAN2 (0.7MWh)

Table 17: Computational effort expenditure and electricity consumption data for this project. The unit for computation is GPU-years on a single NVIDIA V100 GPU—it would have taken ~92 years to execute this project using a single GPU. See the text for additional details about the computation and energy consumption estimates. Early exploration includes early training runs that affected our decision to start this project. Project exploration includes training runs that were done specifically for this project, leading to the final Alias-Free-T and Alias-Free-R configurations. These runs were not intended to be used in the paper as-is. Setting up ablations includes hyperparameter tuning for the intermediate configurations and ablation experiments in Figure 3 & Figure 5. Per-dataset tuning includes hyperparameter tuning for individual datasets, mainly the grid search for R1 regularization weight. Config R at 1024×1024 corresponds to one training run in Figure 5, left, and Other runs in the dataset table includes the remaining runs. Ablation tables includes the low-resolution ablations in Figures 3 and Figure 5. Results intentionally left out includes additional results that were initially planned, but then left out to improve focus and clarity.

“GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation (works for Videos Too!)”, Chong & Forsyth 2021

“GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!)”⁠, Min Jin Chong, David Forsyth (2021-06-11; ⁠, ; similar):

We show how to learn a map that takes a content code, derived from a face image, and a randomly chosen style code to an anime image. We derive an adversarial loss from our simple and effective definitions of style and content. This adversarial loss guarantees the map is diverse—a very wide range of anime can be produced from a single content code. Under plausible assumptions, the map is not just diverse, but also correctly represents the probability of an anime, conditioned on an input face. In contrast, current multimodal generation procedures cannot capture the complex styles that appear in anime. Extensive quantitative experiments support the idea the map is correct. Extensive qualitative results show that the method can generate a much more diverse range of styles than SOTA comparisons. Finally, we show that our formalization of content and style allows us to perform video to video translation without ever training on videos.

“Vector Quantized Models for Planning”, Ozair et al 2021

“Vector Quantized Models for Planning”⁠, Sherjil Ozair, Yazhe Li, Ali Razavi, Ioannis Antonoglou, Aäron van den Oord, Oriol Vinyals (2021-06-08; ⁠, ; similar):

Recent developments in the field of model-based RL have proven successful in a range of environments, especially ones where planning is essential. However, such successes have been limited to deterministic fully-observed environments. We present a new approach that handles stochastic and partially-observable environments. Our key insight is to use discrete autoencoders to capture the multiple possible effects of an action in a stochastic environment. We use a stochastic variant of Monte Carlo tree search to plan over both the agent’s actions and the discrete latent variables representing the environment’s response. Our approach significantly outperforms an offline version of MuZero on a stochastic interpretation of chess where the opponent is considered part of the environment. We also show that our approach scales to DeepMind Lab⁠, a first-person 3D environment with large visual observations and partial observability.

“NWT: Towards Natural Audio-to-video Generation With Representation Learning”, Mama et al 2021

“NWT: Towards natural audio-to-video generation with representation learning”⁠, Rayhane Mama, Marc S. Tyndel, Hashiam Kadhim, Cole Clifford, Ragavan Thurairatnam (2021-06-08):

In this work we introduce NWT, an expressive speech-to-video model. Unlike approaches that use domain-specific intermediate representations such as pose keypoints, NWT learns its own latent representations, with minimal assumptions about the audio and video content. To this end, we propose a novel discrete variational autoencoder with adversarial loss, dVAE-Adv, which learns a new discrete latent representation we call Memcodes. Memcodes are straightforward to implement, require no additional loss terms, are stable to train compared with other approaches, and show evidence of interpretability. To predict on the Memcode space, we use an autoregressive encoder-decoder model conditioned on audio. Additionally, our model can control latent attributes in the generated video that are not annotated in the data. We train NWT on clips from HBO’s Last Week Tonight with John Oliver. NWT consistently scores above other approaches in Mean Opinion Score (MOS) on tests of overall video naturalness, facial naturalness and expressiveness, and lipsync quality. This work sets a strong baseline for generalized audio-to-video synthesis. Samples are available at⁠.

“GODIVA: Generating Open-DomaIn Videos from NAtural Descriptions”, Wu et al 2021

“GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions”⁠, Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, Nan Duan (2021-04-30; ; similar):

Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation. Existing works typically experiment on simple or small datasets, where the generalization ability is quite limited.

In this work, we propose GODIVA, an open-domain text-to-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism. We pretrain our model on Howto100M, a large-scale text-video dataset that contains more than 136 million text-video pairs. Experiments show that GODIVA not only can be fine-tuned on downstream video generation tasks, but also has a good zero-shot capability on unseen texts.

We also propose a new metric called Relative Matching (RM) to automatically evaluate the video generation quality. Several challenges are listed and discussed as future work.

“VideoGPT: Video Generation Using VQ-VAE and Transformers”, Yan et al 2021

“VideoGPT: Video Generation using VQ-VAE and Transformers”⁠, Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas (2021-04-20; ; backlinks; similar):

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos.

VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings.

Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF101 and Tumbler GIF Dataset (TGIF).

We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at

“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) Releases Wu Dao 1.0, China’s First Large-scale Pretraining Model.”, Synced 2021

“China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.”⁠, Synced (2021-03-23; ⁠, ⁠, ; backlinks; similar):

[Fun note: the corpus uses The Pile⁠.] In a bid to promote the research and development of China’s own large-scale pretraining models and further explore universal intelligence from a more fundamental perspective, the Beijing Academy of Artificial Intelligence (BAAI) recently unveiled Wu Dao 1.0, China’s first homegrown super-scale intelligent model system. The work was led by BAAI Research Academic Vice President and Tsinghua University Professor Tang Jie, with contributions from a team of more than 100 AI scientists from Peking University, Tsinghua University, Renmin University of China, Chinese Academy of Sciences and other institutes.

Wu Dao 1.0 has initiated large-scale research projects via 4 related models: Wu Dao—Wen Yuan, Wu Dao—Wen Lan, Wu Dao—Wen Hui, and Wu Dao—Wen Su.

  1. Wu Dao—Wen Yuan: is China’s largest-ever pretraining language model, boasting the best processing power in mainstream languages, including Chinese and English. It has surpassed average human performance benchmarks on text categorization, sentiment analysis, natural language inference, reading comprehension and more. The Wu Dao—Wen Yuan project is designed to explore universal natural language understanding (NLU) techniques and study brain-inspired language models. It has 2.6 billion parameters and is capable of performing cognitive activities such as memorization, comprehension, retrieval, numerical calculation, multi-language, etc. Wu Dao—Wen Yuan has achieved GPT-3 comparable performance on 20 Chinese NLP tasks such as open-domain answering, grammar correction, sentiment analysis, etc.

    …Wen Yuan introduces the open-source Chinese pretraining model (CPM). Based on CPM, the CPM-Distill model reduces language confusion by 38% and achieves better results on downstream tasks.

  2. Wu Dao—Wen Lan: meanwhile, is the first publicly available Chinese universal graphic multimodal pretraining model. The ultra-large-scale multimodal pretraining model aims to break through the theoretical challenges of pretraining multimodal data based on a combination of graphics, text and video, and eventually generate industrial-grade Chinese graphics pretraining models and applications that exceed SOTA performance. Currently, the model has 1 billion parameters and is trained on 50 million graphic pairs collected from open sources. The Wu Dao—Wen Lan model has reached SOTA performance, scoring 5% higher than the champion team on the Image Caption task on the Chinese public multimodal test set AIC-ICC and 20% higher than the most popular UNITER model on the Visual Entailment task.

    …Wen Lan is the first Chinese generic multimodal pretraining model that can understand “connotative information” based on weak correlations of images and text. Wen Lan uses an advanced cross-modal contrast learning algorithm: Given an image-text pair, it can enlarge the number of negative samples for each modal, especially for those which are difficult to distinguish, further improving the expression ability of neural networks. It can easily replace image and text encoders with the most advanced single-mode pretraining model, achieving 20× faster performance than the UNITER model.

  3. Wu Dao—Wen Hui: is an ultra-large-scale cognitive-oriented pretraining model that focuses on a series of essential problems in general artificial intelligence from a cognitive perspective, aiming to develop and enhance the logic/​consciousness/​reasoning-based cognitive capabilities of pretraining models. Wu Dao—Wen Hui has reached 11.3 billion parameters, and through simple fine-tuning can generate poetry, make videos, draw pictures, retrieve text, perform complex reasoning, etc. BAAI says the model achieves near-human performance on poetry generation on the Turing test.

    …Wen Hui proposes a new pretraining paradigm, Generative Language Model, breaking the bottlenecks of BERT and GPT. For the first time in history, a single model has achieved the best results in language understanding and generating tasks, and surpassed common pretraining models such as BERT, RoBERTa and T5 that trained on the same volume of data. Wen Hui’s continuous vector based fine-tuning method, P-tuning⁠, is the first autoregressive model that surpasses the autoencoder model in NLU tasks and has achieved SOTA results on more than 10 tasks such as Knowledge Extraction and SuperGLUE Few-shot Learning, with over 20% performance improvement. Wen Hui’s inverse prompting algorithm achieves close to human performance on the task of Q&A and poetry generation, and is the first model that can generate classical Chinese poetry based on modern themes.

  4. Wu Dao—Wen Su: is a large-scale training model for biomolecular structure prediction. It can handle super long biomolecular structures, where it has achieved SOTA performance, interpretability and robustness. Based on Google’s BERT language model, Wu Dao—Wen Su has completed protein training on the 100 GB UNIPARC database and gene training on 5–100,000 human peripheral blood immune cells (25–30 cell types) and 10,000 drug-resistant bacteria.

    …Wen Su’s open-sourced FastMoE is the first high-performance MoE (Mixture-of-Experts Model) system that supports the PyTorch framework and a variety of hardware. Only one line of code is required to complete the MoE transformation, and model training speed is increased by 47× compared with the traditional PyTorch implementation.

“Clockwork Variational Autoencoders”, Saxena et al 2021

“Clockwork Variational Autoencoders”⁠, Vaibhav Saxena, Jimmy Ba, Danijar Hafner (2021-02-18; similar):

Deep learning has enabled algorithms to generate realistic images. However, accurately predicting long video sequences requires understanding long-term dependencies and remains an open challenge. While existing video prediction models succeed at generating sharp images, they tend to fail at accurately predicting far into the future. We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences, where higher levels tick at slower intervals. We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets with sequences of up to 1000 frames, where CW-VAE outperforms top video prediction models. Additionally, we propose a Minecraft benchmark for long-term video prediction. We conduct several experiments to gain insights into CW-VAE and confirm that slower levels learn to represent objects that change more slowly in the video, and faster levels learn to represent faster objects.

“Scaling Laws for Autoregressive Generative Modeling”, Henighan et al 2020

“Scaling Laws for Autoregressive Generative Modeling”⁠, Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown et al (2020-10-28; ⁠, ; backlinks; similar):

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image ↔︎ text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains.

The cross-entropy loss has an information theoretic interpretation as S(True)+DKL(True||Model), and the empirical scaling laws suggest a prediction for both the true data distribution’s entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an 8×8 resolution, and we can forecast the model size needed to achieve any given reducible loss (ie. DKL) in nats/​image for other resolutions.

We find a number of additional scaling laws in specific domains: (1) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question “Is a picture worth a thousand words?”; (2) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (3) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.

…As we increase model and dataset sizes, optimization becomes increasingly efficient, until eventually learning curves begin to merge with the L(D) trend, so that there are no benefits to be gained from training for more than a single epoch [Komatsuzaki 2019].

…We have argued that a single neural architecture, the Transformer, can be applied to the generative modeling of images, videos, multimodal data, and math, along with language [Kaplan et al 2020⁠, Brown et al 2020]. We identified common scaling laws for the loss achieved on all data modalities as a function of both model size and compute budget. As in the case of language, these results imply that larger models become more sample-efficient. Furthermore, we found that in some important cases, fine-tuned performance on downstream tasks also follows similar scaling laws. This suggests that trends in the generative modeling loss translate into advantages in practical capabilities.

A greater surprise was the universal trend (figure 2) for optimal model size as a function of the training compute budget—we did not anticipate that the exponent NoptC0.7 would be largely independent of the data distribution. This trend implies a dual trend for the number of tokens elapsed during optimized training, as a function of C or N, and leads to the conclusion that larger compute budgets should be “spent” mostly on larger models, rather than much longer training runs. So this lesson from language modeling [Kaplan et al 2020] generalizes. These empirical regularities beg for theoretical explanation—why do these scaling relations hold? The scaling laws also suggest a shift in perspective away from the particularities of neural architectures, loss functions⁠, and training algorithms and towards the broader commonalities that appear when machine learning is studied across a large hierarchy of model, data, and compute scales. Work in ML often involves identifying specific deficiencies in current capabilities and remedying them through the alteration of models and algorithms. Perhaps many capabilities simply lie on a spectrum that can be continuously unlocked through increasing scale, as might be suggested by the meta-learning capabilities of the GPT-3 model [Brown et al 2020].

Figure 1: Smooth scaling of reducible loss across domains—We show power-law scaling laws for the reducible loss L−L∞ as a function of compute, where the irreducible loss L∞ is a fitted domain-dependent constant. Under plausible assumptions concerning the infinite data and compute limits, the irreducible loss estimates the entropy of the underlying data distribution, while the reducible loss approximates the KL divergence between the data and model distributions. In the case of language we use results from [BMR+20], and only show the full loss L.
Table 1: Summary of scaling laws—In this table we summarize the model size and compute scaling fits to equation (1.1) along with Nopt(C), with the loss in nats/​token, and compute measured in petaflop-days. In most cases the irreducible losses match quite well between model size and compute scaling laws. The math compute scaling law may be affected by the use of weight decay, which typically hurts performance early in training and improves performance late in training. The compute scaling results and data for language are from [BMR+20], while_N_opt(C)comes from [KMH+20]. Unfortunately, even with data from the largest language models we cannot yet obtain a meaningful estimate for the entropy of natural language.
Figure 2: Optimal model size is consistent across domains—We display the optimal model size Nopt as a function of the training compute budget C. Not only does Nopt(C) behave as a power-law, but the behavior is remarkably similar for all data modalities.
Figure 31: Q&A—We show the progression of simple Q&A capabilities of GPT-3 family models as we increase the parameter count [BMR+20]. We ask the model who the first and second president of the United States was. · Tiny models appear to have trouble understanding the question, and don’t place any substantial probability on the correct answer. Larger models understand that we’re requesting a US president, but fail to understand that the “second president” and “first president” are different requests, placing most of their weight for both questions on “George Washington”. Only larger models understand both aspects of the questions, answering both correctly.

[See also: Figure 3 & Figure 11⁠.]

“Implicit Neural Representations With Periodic Activation Functions”, Sitzmann et al 2020

“Implicit Neural Representations with Periodic Activation Functions”⁠, Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, Gordon Wetzstein (2020-06-17; ; backlinks; similar):

Implicitly defined, continuous, differentiable signal representations parameterized by neural networks have emerged as a powerful paradigm, offering many possible benefits over conventional representations. However, current network architectures for such implicit neural representations are incapable of modeling signals with fine detail, and fail to represent a signal’s spatial and temporal derivatives, despite the fact that these are essential to many physical signals defined implicitly as the solution to partial differential equations.

We propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives. We analyze Siren activation statistics to propose a principled initialization scheme and demonstrate the representation of images, wavefields, video, sound, and their derivatives. Further, we show how Sirens can be leveraged to solve challenging boundary value problems, such as particular Eikonal equations (yielding signed distance functions), the Poisson equation, and the Helmholtz and wave equations. Lastly, we combine Sirens with hypernetworks to learn priors over the space of Siren functions.

“NeRF: Representing Scenes As Neural Radiance Fields for View Synthesis”, Mildenhall et al 2020

“NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”⁠, Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng (2020-03-19; ; backlinks; similar):

We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (Θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.

“High Fidelity Video Prediction With Large Stochastic Recurrent Neural Networks”, Villegas et al 2019

“High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks”⁠, Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V. Le, Honglak Lee (2019-11-05; ⁠, ; backlinks; similar):

Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex inductive biases inside network architectures with highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: finding minimal inductive bias for video prediction while maximizing network capacity. We investigate this question by performing the first large-scale empirical study and demonstrate state-of-the-art performance by learning large models on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling car driving.

“Learning to Predict Without Looking Ahead: World Models Without Forward Prediction”, Freeman et al 2019

“Learning to Predict Without Looking Ahead: World Models Without Forward Prediction”⁠, C. Daniel Freeman, Luke Metz, David Ha (2019-10-29; ; backlinks; similar):

[HTML version of Freeman et al 2019⁠, with videos.]

Much of model-based reinforcement learning involves learning a model of an agent’s world, and training an agent to leverage this model to perform a task more efficiently. While these models are demonstrably useful for agents, every naturally occurring model of the world of which we are aware—eg. a brain—arose as the byproduct of competing evolutionary pressures for survival, not minimization of a supervised forward-predictive loss via gradient descent. That useful models can arise out of the messy and slow optimization process of evolution suggests that forward-predictive modeling can arise as a side-effect of optimization under the right circumstances. Crucially, this optimization process need not explicitly be a forward-predictive loss. In this work, we introduce a modification to traditional reinforcement learning which we call observational dropout, whereby we limit the agents ability to observe the real environment at each timestep. In doing so, we can coerce an agent into learning a world model to fill in the observation gaps during reinforcement learning. We show that the emerged world model, while not explicitly trained to predict the future, can help the agent learn key skills required to perform well in its environment.

[Image caption: “Our agents are only given infrequent observations of the real environment. As a side effect for optimizing performance in this setting, a”world model” emerges. We show the true dynamics in color, with full saturation denoting frames the policy can see. The black and white outline shows the state of the emergent world model. These world model exhibits similar, but not identical dynamics to forward predictive models but only model “important” aspects of the environment.”]

“Learning to Predict Without Looking Ahead: World Models Without Forward Prediction”, Freeman et al 2019

“Learning to Predict Without Looking Ahead: World Models Without Forward Prediction”⁠, C. Daniel Freeman, Luke Metz, David Ha (2019-10-29; ; backlinks; similar):

Much of model-based reinforcement learning involves learning a model of an agent’s world, and training an agent to leverage this model to perform a task more efficiently. While these models are demonstrably useful for agents, every naturally occurring model of the world of which we are aware—eg. a brain—arose as the byproduct of competing evolutionary pressures for survival, not minimization of a supervised forward-predictive loss via gradient descent. That useful models can arise out of the messy and slow optimization process of evolution suggests that forward-predictive modeling can arise as a side-effect of optimization under the right circumstances. Crucially, this optimization process need not explicitly be a forward-predictive loss. In this work, we introduce a modification to traditional reinforcement learning which we call observational dropout, whereby we limit the agents ability to observe the real environment at each timestep. In doing so, we can coerce an agent into learning a world model to fill in the observation gaps during reinforcement learning. We show that the emerged world model, while not explicitly trained to predict the future, can help the agent learn key skills required to perform well in its environment. Videos of our results available at https:/​/​​#google

“Scaling Autoregressive Video Models”, Weissenborn et al 2019

“Scaling Autoregressive Video Models”⁠, Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit (2019-06-06; ; backlinks; similar):

Due to the statistical complexity of video, the high degree of inherent stochasticity, and the sheer amount of data, generating natural video remains a challenging task. State-of-the-art video generation models often attempt to address these issues by combining sometimes complex, usually video-specific neural network architectures, latent variable models, adversarial training and a range of other methods. Despite their often high complexity, these approaches still fall short of generating high quality video continuations outside of narrow domains and often struggle with fidelity. In contrast, we show that conceptually simple autoregressive video generation models based on a three-dimensional self-attention mechanism achieve competitive results across multiple metrics on popular benchmark datasets, for which they produce continuations of high fidelity and realism. We also present results from training our models on Kinetics, a large scale action recognition dataset comprised of YouTube videos exhibiting phenomena such as camera movement, complex object interactions and diverse human movement. While modeling these phenomena consistently remains elusive, we hope that our results, which include occasional realistic continuations encourage further research on comparatively complex, large scale datasets such as Kinetics.

“Model-Based Reinforcement Learning for Atari”, Kaiser et al 2019

“Model-Based Reinforcement Learning for Atari”⁠, Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, Konrad Czechowski et al (2019-03-01; similar):

Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction—substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games SimPLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude.

“Parallel Multiscale Autoregressive Density Estimation”, Reed et al 2017

“Parallel Multiscale Autoregressive Density Estimation”⁠, Scott Reed, Aäron van den Oord, Nal Kalchbrenner, Sergio Gómez Colmenarejo, Ziyu Wang, Dan Belov, Nando de Freitas et al (2017-03-10; ⁠, ; similar):

PixelCNN achieves state-of-the-art results in density estimation for natural images. Although training is fast, inference is costly, requiring one network evaluation per pixel; 𝑂(N) for N pixels. This can be sped up by caching activations, but still involves generating each pixel sequentially.

In this work, we propose a parallelized PixelCNN that allows more efficient inference by modeling certain pixel groups as conditionally independent. Our new PixelCNN model achieves competitive density estimation and orders of magnitude speedup—𝑂(log n) sampling instead of 𝑂(N)—enabling the practical generation of 512×512 images.

We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive density models that allow efficient sampling.