Skip to main content

AI/​video/​analysis directory


“VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Wang et al 2022

“VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners”⁠, Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang et al (2022-05-22; ):

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting.

We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets.

We use the image-language models [CLIP + BLIP] to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts.

Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model substantially outperforms state-of-the-art supervised models trained on large-scale video datasets.

Code and resources are publicly available for research purposes at Github⁠.

“Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-time Planning”, Qi et al 2022

“Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning”⁠, Carl Qi, Pieter Abbeel, Aditya Grover (2022-04-07; ⁠, ; similar):

The goal of imitation learning is to mimic expert behavior from demonstrations, without access to an explicit reward signal. A popular class of approach infers the (unknown) reward function via inverse reinforcement learning (IRL) followed by maximizing this reward function via reinforcement learning (RL). The policies learned via these approaches are however very brittle in practice and deteriorate quickly even with small test-time perturbations due to compounding errors. We propose Imitation with Planning at Test-time (IMPLANT), a new meta-algorithm for imitation learning that utilizes decision-time planning to correct for compounding errors of any base imitation policy. In contrast to existing approaches, we retain both the imitation policy and the rewards model at decision-time, thereby benefiting from the learning signal of the two components. Empirically, we demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments and excels at zero-shot generalization when subject to challenging perturbations in test-time dynamics.

“Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Zeng et al 2022

“Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language”⁠, Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo et al (2022-04-01; ⁠, ⁠, ; similar):

Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (eg. from spreadsheets, to SAT questions). As a result, these models store different forms of commonsense knowledge across different domains.

In this work, we show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue—in which new multimodal tasks are formulated as a guided language-based exchange between different pre-existing foundation models, without additional finetuning.

In the context of egocentric perception, we present a case study of Socratic Models (SMs) that can provide meaningful results for complex tasks such as generating free-form answers to contextual questions about egocentric video, by formulating video Q&A as short story Q&A, i.e. summarizing the video into a short story, then answering questions about it. Additionally, SMs can generate captions for Internet images, and are competitive with state-of-the-art on zero-shot video-to-text retrieval with 42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models zero-shot to capture new multimodal functionalities, without domain-specific data collection. Prototypes are available at Github⁠.

Figure 2: In this work we propose Socratic Models (SMs), a framework that uses structured dialogue between pre-existing foundation models, each of which can exhibit unique (but complementary) capabilities depending on the distributions of data on which they are trained. On various perceptual tasks (shown), this work presents a case study of SMs with visual language models (VLMs, eg. CLIP), large language models (LMs, eg. GPT-3, RoBERTa), and audio language models (ALMs, eg. Wav2CLIP, Speech2Text). From video search, to image captioning; from generating free-form answers to contextual reasoning questions, to forecasting future activities—SMs can provide meaningful results for complex tasks across classically challenging computer vision domains, without any model finetuning.

…Across a number of tasks spanning vision, language, and audio modalities, we find that specific instantiations of SMs, using LMs together with VLMs and audio-language models (ALMs), can generate results on challenging perceptual tasks (examples in Figure 2) that are often coherent and correct. We present results on Internet image captioning (Section 4) and the common video understanding task of video-to-text retrieval (Section 5), but our highlighted application is open-ended reasoning in the context of egocentric perception (Figure 4)—from answering free-form contextual reasoning questions about first-person videos (eg. “why did I go to the front porch today?”), to forecasting events into the future with commonsense (eg. “what will I do 3 hours from now?”). Our egocentric SM system consists of 2 primary components, each of which benefits from multimodal multi-model discussions: (1) assembling video into a language-based world-state history, i.e. a story or event log, then (2) performing various types of open-ended text-prompted tasks based on that world-state history. We find that simple scripted policies to guide a closed-loop exchange between pre-trained LM, VLM, and ALM models can (1) generate meaningful captions that respond to questions like “what am I doing?” with answers like “receiving a package” that span beyond the label set of standard vision datasets (Sigurdsson et al 2018; Smaira et al 2020), and (2) exhibit open-ended contextual Q&A capabilities previously thought to be out-of-reach for egocentric perception without domain-specific data collection (Grauman et al 2021; Damen et al 2020).

…In the context of egocentric perception, we find that formulating video Q&A as reading comprehension in SMs directly leverages the extent to which large LMs are capable of logical reasoning by connecting commonsense relationships with knowledge learned from Internet-scale data. For example, the system returns the following answer when presented with the world-state history log:

8:00 AM: went to grocery store to buy orange juice, chocolate, and bread.
8:15 AM: I went to gas station to fill up the vehicle tank.
8:30 AM: drove back home and left the groceries in the kitchen.
8:45 AM: started cooking eggs in the pan.
9:00 AM: the dog went into the kitchen.
9:15 AM: took the dog out for a walk.
9:30 AM: the dog is sick.
Q: Why is the dog sick? A: The dog may have eaten something it was not supposed to, such as chocolate.

Arriving at the answer requires bridging multiple connections between observations eg. that the dog went into the kitchen, that the groceries are still in the kitchen, and that the groceries contain chocolate. Such results offer a glimpse of what might be possible using SMs for deductive reasoning across multiple domains of information, and raises interesting research questions on (1) how to better assemble language-based world-state histories (beyond what is presented in this work) that capture relevant evidence to im prove the accuracy of conclusions, and (2) how to elicit chain of thought prompting (Wei et al 2022) to decompose multi-step problems into intermediate ones. For example, one promising extension could be prompting the LM with chain of thought sequences to expand on hypotheses:

Q: What are reasons for why I might be chopping wood?
A: Reasons might include: needing firewood, wanting to make a statement, or needing the exercise.

to which each hypothesis can be progressively explored by downstream subprograms called at recursively higher resolutions until a conclusion is reached. These directions suggest pathways towards achieving increasingly meaningful utility and analysis by digital multimodal assistants. [cf. Elicit/​Ought]

“Reinforcement Learning With Action-Free Pre-Training from Videos”, Seo et al 2022

“Reinforcement Learning with Action-Free Pre-Training from Videos”⁠, Younggyo Seo, Kimin Lee, Stephen James, Pieter Abbeel (2022-03-25; ⁠, ):

Recent unsupervised pre-training methods have shown to be effective on language and vision domains by learning useful representations for multiple downstream tasks. In this paper, we investigate if such unsupervised pre-training methods can also be effective for vision-based reinforcement learning (RL).

To this end, we introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos. Our framework consists of two phases: we pre-train an action-free latent video prediction model, and then utilize the pre-trained representations for efficiently learning action-conditional world models on unseen environments. To incorporate additional action inputs during fine-tuning, we introduce a new architecture that stacks an action-conditional latent prediction model on top of the pre-trained action-free prediction model.

Moreover, for better exploration, we propose a video-based intrinsic bonus that leverages pre-trained representations. We demonstrate that our framework significantly improves both final performances and sample-efficiency of vision-based RL in a variety of manipulation and locomotion tasks. Code is available at Github⁠.

“CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Taesiri et al 2022

“CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning”⁠, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer (2022-03-21; ; similar):

Gameplay videos contain rich information about how players interact with the game and how the game responds. Sharing gameplay videos on social media platforms, such as Reddit, has become a common practice for many players. Often, players will share gameplay videos that showcase video game bugs. Such gameplay videos are software artifacts that can be utilized for game testing, as they provide insight for bug analysis. Although large repositories of gameplay videos exist, parsing and mining them in an effective and structured fashion has still remained a big challenge.

In this paper, we propose a search method that accepts any English text query as input to retrieve relevant videos from large repositories of gameplay videos. Our approach does not rely on any external information (such as video metadata); it works solely based on the content of the video. By leveraging the zero-shot transfer capabilities of the Contrastive Language-Image Pre-Training (CLIP) model, our approach does not require any data labeling or training.

To evaluate our approach, we present the GamePhysics dataset consisting of 26,954 videos from 1,873 games, that were collected from the GamePhysics section on the Reddit website.

Our approach shows promising results in our extensive analysis of simple queries, compound queries, and bug queries, indicating that our approach is useful for object and event detection in gameplay videos. An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs.

Please visit the following link for the code and the data: CLIPxGamePhysics⁠.

“Robot Peels Banana With Goal-conditioned Dual-action Deep Imitation Learning”, Kim et al 2022

“Robot peels banana with goal-conditioned dual-action deep imitation learning”⁠, Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi (2022-03-18; ; similar):

A long-horizon dexterous robot manipulation task of deformable objects, such as banana peeling, is problematic because of difficulties in object modeling and a lack of knowledge about stable and dexterous manipulation skills. This paper presents a goal-conditioned dual-action deep imitation learning (DIL) which can learn dexterous manipulation skills using human demonstration data. Previous DIL methods map the current sensory input and reactive action, which easily fails because of compounding errors in imitation learning caused by recurrent computation of actions. The proposed method predicts reactive action when the precise manipulation of the target object is required (local action) and generates the entire trajectory when the precise manipulation is not required. This dual-action formulation effectively prevents compounding error with the trajectory-based global action while respond to unexpected changes in the target object with the reactive local action. Furthermore, in this formulation, both global/​local actions are conditioned by a goal state which is defined as the last step of each subtask, for robust policy prediction. The proposed method was tested in the real dual-arm robot and successfully accomplished the banana peeling task.

“Hierarchical Perceiver”, Carreira et al 2022

“Hierarchical Perceiver”⁠, Joao Carreira, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer et al (2022-02-22; backlinks; similar):

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by exclusively using global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). HiP retains the ability to process arbitrary modalities, but now at higher-resolution and without any specialized preprocessing, improving over flat Perceivers in both efficiency and accuracy on the ImageNet⁠, AudioSet and PASCAL VOC datasets.

“MuZero With Self-competition for Rate Control in VP9 Video Compression”, Mandhane et al 2022

“MuZero with Self-competition for Rate Control in VP9 Video Compression”⁠, Amol Mandhane, Anton Zhernov, Maribeth Rauh, Chenjie Gu, Miaosen Wang, Flora Xue, Wendy Shang, Derek Pang et al (2022-02-14; ; backlinks; similar):

Video streaming usage has seen a significant rise as entertainment, education, and business increasingly rely on online video. Optimizing video compression has the potential to increase access and quality of content to users, and reduce energy use and costs overall. In this paper, we present an application of the MuZero algorithm to the challenge of video compression. Specifically, we target the problem of learning a rate control policy to select the quantization parameters (QP) in the encoding process of libvpx, an open source VP9 video compression library widely used by popular video-on-demand (VOD) services.

We treat this as a sequential decision making problem to maximize the video quality with an episodic constraint imposed by the target bitrate. Notably, we introduce a novel self-competition based reward mechanism to solve constrained RL with variable constraint satisfaction difficulty, which is challenging for existing constrained RL methods. We demonstrate that the MuZero-based rate control achieves an average 6.28% reduction in size of the compressed videos for the same delivered video quality level (measured as PSNR BD-rate) compared to libvpx’s two-pass VBR rate control policy, while having better constraint satisfaction behavior.

“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Li et al 2022

“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”⁠, Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi (2022-01-28; ; similar):

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.

In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

Code, models, and datasets are released at Github⁠.

“MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”, Wu et al 2022

“MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”⁠, Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer et al (2022-01-20; backlinks; similar):

While today’s video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks.

In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache “memory” at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT⁠, a Memory-augmented Multiscale Vision Transformer⁠, that has a temporal support 30× longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models will be made publicly available.

“CAST: Character Labeling in Animation Using Self-supervision by Tracking”, Nir et al 2022

“CAST: Character labeling in Animation using Self-supervision by Tracking”⁠, Oron Nir, Gal Rapoport, Ariel Shamir (2022-01-19; ; similar):

Cartoons and animation domain videos have very different characteristics compared to real-life images and videos. In addition, this domain carries a large variability in styles. Current computer vision and deep-learning solutions often fail on animated content because they were trained on natural images. In this paper we present a method to refine a semantic representation suitable for specific animated content.

We first train a neural network on a large-scale set of animation videos and use the mapping to deep features as an embedding space. Next, we use self-supervision to refine the representation for any specific animation style by gathering many examples of animated characters in this style, using a multi-object tracking. These examples are used to define triplets for contrastive loss training. The refined semantic space allows better clustering of animated characters even when they have diverse manifestations.

Using this space we can build dictionaries of characters in an animation videos, and define specialized classifiers for specific stylistic content (eg. characters in a specific animation series) with very little user effort. These classifiers are the basis for automatically labeling characters in animation videos.

We present results on a collection of characters in a variety of animation styles.

“Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”, Shi et al 2022

“Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”⁠, Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed (2022-01-05; ; similar):

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at

“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Soldan et al 2021

“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”⁠, Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem et al (2021-12-01; ; similar):

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases.

In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD’s collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours.

“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, Zhang et al 2021

“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”⁠, David Junhao Zhang, Kunchang Li, Yunpeng Chen, Yali Wang, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou et al (2021-11-24; ; backlinks; similar):

Self-attention has become an integral component of the recent network architectures, eg. Transformer, that dominate major image and video benchmarks. This is because self-attention can flexibly model long-range information. For the same reason, researchers make attempts recently to revive Multiple Layer Perceptron (MLP) and propose a few MLP-Like architectures, showing great potential. However, the current MLP-Like architectures are not good at capturing local details and lack progressive understanding of core details in the images and/​or videos.

To overcome this issue, we propose a novel MorphMLP architecture that focuses on capturing local details at the low-level layers, while gradually changing to focus on long-term modeling at the high-level layers. Specifically, we design a Fully-Connected-Like layer, dubbed as MorphFC, of two morphable filters that gradually grow its receptive field along the height and width dimension. More interestingly, we propose to flexibly adapt our MorphFC layer in the video domain. To our best knowledge, we are the first to create a MLP-Like backbone for learning video representation. Finally, we conduct extensive experiments on image classification, semantic segmentation and video classification. Our MorphMLP, such a self-attention free backbone, can be as powerful as and even outperform self-attention based models.

“Florence: A New Foundation Model for Computer Vision”, Yuan et al 2021

“Florence: A New Foundation Model for Computer Vision”⁠, Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang et al (2021-11-22; ⁠, ⁠, ; backlinks; similar):

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.

While existing vision foundation models such as CLIP, ALIGN⁠, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, [trained using UniCL] to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection⁠, VQA, image caption, video retrieval and action recognition.

Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks.

Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, eg. ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

“Scaling ASR Improves Zero and Few Shot Learning”, Xiao et al 2021

“Scaling ASR Improves Zero and Few Shot Learning”⁠, Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf et al (2021-11-10; ; backlinks; similar):

With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and model sharding. By training 1–10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains. Furthermore, our models learn powerful speech representations with zero and few-shot capabilities on novel domains and styles of speech, exceeding previous results across multiple in-house and public benchmarks. For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively, while realizing the best performance on public social media videos. Furthermore, the same universal model reaches equivalent performance with 500× less in-domain data on the SPGISpeech financial-domain dataset.

“ADOP: Approximate Differentiable One-Pixel Point Rendering”, Rückert et al 2021

“ADOP: Approximate Differentiable One-Pixel Point Rendering”⁠, Darius Rückert, Linus Franke, Marc Stamminger (2021-10-13; ):

In this paper we present ADOP, a novel point-based, differentiable neural rendering pipeline. Like other neural renderers, our system takes as input calibrated camera images and a proxy geometry of the scene, in our case a point cloud. To generate a novel view, the point cloud is rasterized with learned feature vectors as colors and a deep neural network fills the remaining holes and shades each output pixel. The rasterizer renders points as one-pixel splats, which makes it very fast and allows us to compute gradients with respect to all relevant input parameters efficiently. Furthermore, our pipeline contains a fully differentiable physically-based photometric camera model, including exposure, white balance, and a camera response function. Following the idea of inverse rendering, we use our renderer to refine its input in order to reduce inconsistencies and optimize the quality of its output. In particular, we can optimize structural parameters like the camera pose, lens distortions, point positions and features, and a neural environment map, but also photometric parameters like camera response function, vignetting, and per-image exposure and white balance. Because our pipeline includes photometric parameters, eg. and camera response function, our system can smoothly handle input images with varying exposure and white balance, and generates high-dynamic range output. We show that due to the improved input, we can achieve high render quality, also for difficult input, eg. with imperfect camera calibrations, inaccurate proxy geometry, or varying exposure. As a result, a simpler and thus faster deep neural network is sufficient for reconstruction. In combination with the fast point rasterization, ADOP achieves real-time rendering rates even for models with well over 100M points.

“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Xu et al 2021

“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”⁠, Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze Luke Zettlemoyer Christoph Feichtenhofer et al (2021-09-28; ; similar):

We present VideoCLIP, a contrastive approach to pre-train an unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval.

Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches.

Code is made available at Github⁠.

“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Jaegle et al 2021

“Perceiver IO: A General Architecture for Structured Inputs & Outputs”⁠, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula et al (2021-07-30; ; backlinks; similar):

[code⁠; Hugging Face] The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics.

Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes.

The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II⁠, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization, and achieves state-of-the-art performance on Sintel optical flow estimation.

Figure 2: The Perceiver IO architecture. Perceiver IO maps arbitrary input arrays to arbitrary output arrays in a domain agnostic process. The bulk of the computation happens in a latent space whose size is typically smaller than the inputs and outputs, which makes the process computationally tractable even for very large inputs & outputs.

…The Perceiver IO architecture relies on the same primitives as Transformers: so why aren’t Transformers all you need? The answer is that Transformers scale very poorly in both compute and memory.82 A Transformer deploys attention modules homogeneously throughout its architecture, using its full input to generate queries and keys at every layer. As discussed in,35 this means each layer scales quadratically in compute and memory, which currently makes it impossible to apply Transformers on high-dimensional data like images without some form of preprocessing. Even on domains like language where Transformers shine, preprocessing (eg. tokenization) is often needed to scale beyond short input sequences. On the other hand, Perceiver IO uses attention non-homogeneously, first using it to map inputs to a latent space, then using it to process in that latent space, and finally using it to map to an output space. The resulting architecture has no quadratic dependence on the input or output size: encoder and decoder attention modules depend linearly on the input and output size (respectively), while latent attention is independent of both input and output sizes. Because of this structure, and the corresponding reduction in compute and memory requirements, Perceivers scale to much larger inputs and outputs. While Transformers are typically used in settings with inputs and outputs of at most a few thousand dimensions [9, 63], we show good results on domains with hundreds of thousands of input and output dimensions.

…Because of this structure, this architecture can be applied to inputs of any shape or spatial layout and even to inputs or outputs which don’t share the same spatial structure (eg. sound and video). However, in contrast to the latent spaces used elsewhere in vision (eg.)67 the latent does not explicitly share the structure (spatial or otherwise) of the inputs. To decode this information, we query for it using cross-attention.

4.4 StarCraft II: To further demonstrate Perceiver IO’s capabilities on discrete modalities and to serve as a drop-in replacement for Transformers, we use Perceiver IO to replace the Transformer in AlphaStar, the state-of-the-art system for the complex game of StarCraft II. At its core, AlphaStar [89] represents the units in the game as a discrete, unordered set of symbols (the “units”). These units are represented by a vector of properties such as unit type, position, health, etc. At each timestep, the architecture encodes up to 512 units “tokens” with a vanilla Transformer. This representation is used both as a summary of the state (after pooling) and as a rich representation of the 512 units. This representation is used by a pointer network [90], to assign a probability to each possible unit selection, effectively parameterizing the agent’s unit selection policy (see 89 and Appendix Section G for more details). We replaced the Transformer that inputs and outputs 512 units with Perceiver IO with a latent size of 32. Without tuning any additional parameters, we observed that the resulting agent reached the same level of performance as the original AlphaStar agent, reaching an 87% win-rate versus the Elite bot after behavioral cloning[61] on human data.

“CLIP-It! Language-Guided Video Summarization”, Narasimhan et al 2021

“CLIP-It! Language-Guided Video Summarization”⁠, Medhini Narasimhan, Anna Rohrbach, Trevor Darrell (2021-07-01; ; similar):

A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited available language models, which can serve as an effective prior for saliency. This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization, typically approached separately in the literature. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another and their correlation with a user-defined query (for query-focused summarization) or an automatically generated dense video caption (for generic video summarization). Our model can be extended to the unsupervised setting by training without ground-truth supervision. We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and SumMe) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method’s strong generalization capabilities.

“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Fang et al 2021

“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”⁠, Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen (2021-06-21; ⁠, ; similar):

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT⁠, MSVD and VATEX.

“Revisiting ResNets: Improved Training and Scaling Strategies”, Bello et al 2021

“Revisiting ResNets: Improved Training and Scaling Strategies”⁠, Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens et al (2021-03-13; ; backlinks; similar):

Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies. Our work revisits the canonical ResNet (He et al 2015) and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we find that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. We show that the best performing scaling strategy depends on the training regime and offer two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended (Tan & Le, 2019). Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1.7×—2.7× faster than EfficientNets on TPUs⁠, while achieving similar accuracies on ImageNet. In a large-scale semi-supervised learning setup, ResNet-RS achieves 86.2% top-1 ImageNet accuracy, while being 4.7× faster than EfficientNet NoisyStudent. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classification on Kinetics-400. We recommend practitioners use these simple revised ResNets as baselines for future research.

“Learning from Videos to Understand the World”, Zweig et al 2021

“Learning from videos to understand the world”⁠, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan (2021-03-12; ⁠, ; similar):

  • Today, we’re announcing a project called “Learning from Videos”, designed to automatically learn audio, textual, and visual representations from the data in publicly available videos uploaded to Facebook.
  • By learning from videos spanning nearly every country and hundreds of languages, this project will not just help us continuously improve our core AI systems for applications like content recommendation and policy enforcement—it will enable entirely new experiences.
  • This is also part of our broader efforts toward building machines that learn like humans do—from any example, not just ones where experts have labeled.
  • The first application is now live in Instagram Reels’ [TikTok-style 15s-long videos] recommendation system.

…Although we’ve just scratched the surface, using semi-supervised and self-supervised learning on the videos uploaded to Facebook has already improved our computer vision and speech recognition systems. Within six months of developing Generalized Data Transformations (GDT), a state-of-the-art, self-supervised framework for video understanding, we’ve built and deployed an AI model in Instagram Reels’ recommendation system. And this is just the beginning of our Learning from Videos project. Early experiments in applying self-supervised learning to real-world videos also show a 20% reduction in speech recognition errors, which could improve a wide range of applications like auto-captioning and tasks that help flag harmful content like hate speech. And we’re researching ways to apply new capabilities, like multimodal video retrieval, in order to make it easier for people to surface key moments in time from their trove of digital memories.

Improving Reels recommendations with self-supervision: Finding similar Reels fits particularly well with self-supervised models because Reels tend to be highly stylized, featuring common patterns across trendy videos. Popular videos often consist of the same music set to the same dance moves, but created and acted by different people. Self-supervised models automatically learn “themes”, group them together, and implicitly make them available to the recommendation system. We’re using self-supervision to suggest videos that are relevant to recently watched videos, while filtering out near-duplicates—without explicit training labels for each classification task. To achieve this, we leveraged Generalized Data Transformations (GDT), our state-of-the-art method for building video embeddings, which systematically learns the relationships between the sound and images in a video. Since building this technology last year, we’ve pioneered the large-scale application of GDT to the representation of Reels data, by training a series of models on a data set of millions of Reels and videos from Instagram…We ran the model in production and made its output available in real time to the ranking system. Using this approach, we were able to run online A/​B tests that showed positive results.

Better speech recognition for more languages and domains: Recently, speech models have been able to successfully learn the entire structure of language using mostly raw speech data—and to improve on traditional, supervised methods. Our latest technique for learning speech representations, called wav2vec 2.0⁠, works by first masking a portion of the speech and then learning to predict masked speech units. To provide an idea of the speed of progress, wav2vec 2.0 and self-training requires only 10 minutes of transcribed audio to achieve very good speech recognition results on the LibriSpeech industry benchmark. The same results required nearly 1,000 hours of transcribed audio just one year ago.

…To test the method on real-world data, we applied wav2vec 2.0 on millions of hours of unlabeled videos and just 100 hours of labeled data. We achieved strong improvements of about 20% relative word error reduction, compared with supervised-only baselines with the 100 hours. This proves, for the first time, that self-supervised learning with wav2vec 2.0 is effective for real-world data sets that are not as curated as the LibriSpeech corpus used in the original paper. The video data we trained wav2vec on is largely varied, and we found that wav2vec performs particularly well for subdomains and accents where little labeled data exists.

As a next step, we’re now working on scaling wav2vec 2.0 to more data and more languages. These models will reduce labeling for new automatic speech recognition domains (eg. like AR glasses and virtual gaming), improve the performance of low-resource and medium-resource models, and improve other speech and audio tasks. As part of these efforts, we’re currently working on training a multilingual model with millions of hours of speech from 25 languages.

Jointly learning video, audio, text to recall digital memories: …Recent self-supervised learning advances have made it possible to create a joint representation of audio, visual, and textual signals in a single vector space. As part of our latest research efforts, we are using the combination of Facebook videos and their associated text (title, caption, descriptions) as the key lever for multimodal understanding…We’ve previously achieved this for images rather than videos using billions of public images and thousands of hashtags…In this research model, we extract a visual clip—which is a short sequence of visual frames—from a video every second. Our system analyzes this sequence using a convolutional neural network (CNN) to produce a vector of numbers that represents the information in the clip. This information is aggregated across time, both with another CNN and with an attention model. The output of this process is an overall representation of the information in the visual part of the video. We follow a similar process with audio…As a next step, we’re now working on scaling this feature up to millions of videos before we can start testing the feature in production.

…Our Learning from Videos project signals a paradigm shift in the way machines are able to understand videos, sending us on the path to build smarter AI systems. This work will allow us to move away from AI that requires people to look at and label videos by hand, and will make it possible for us to build AI systems that use the most advanced techniques, such as self-supervision, to improve recommendations, search, and retrieval, and other important applications for everyone on Facebook. As our systems continuously learn, they will become more reliable, efficient, and personalized, so that sharing and rediscovering moments can one day be effortless. We are excited to continue our research in the space as we share more of our findings and work to productionize cutting-edge AI research that improves our core technology systems, unlocking new experiences for the billions of people around the world who use our products and services every day.

“Perceiver: General Perception With Iterative Attention”, Jaegle et al 2021

“Perceiver: General Perception with Iterative Attention”⁠, Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira (2021-03-04; backlinks; similar):

[video⁠, code (including Perceiver IO); Hierarchical Perceiver⁠; cf. Universal Transformers] Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities.

In this paper we introduce the Perceiver—a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like Convnets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs.

We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet. [See also “Set Transformer” for efficient attention⁠.]

Figure 1: The Perceiver is an architecture based on attentional principles that scales to high-dimensional inputs such as images, videos, audio, point-clouds (and multimodal combinations) without making any domain-specific assumptions. The Perceiver uses a cross-attention module to project an input high-dimensional byte array to a fixed-dimensional latent bottleneck (M ≫ N) before processing it using a stack of transformers in the low-d latent space. The Perceiver iteratively attends to the input byte array by alternating cross-attention and latent transformer blocks.

“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”, Lee et al 2021

“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”⁠, Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song (2021-01-26; ; backlinks; similar):

Large-scale datasets are the cornerstone of self-supervised representation learning. Existing algorithms extract learning signals by making certain assumptions about the data, eg. spatio-temporal continuity and multimodal correspondence. Unfortunately, finding a large amount of data that satisfies such assumptions is sometimes not straightforward. This restricts the community to rely on datasets that require laborious annotation and/​or manual filtering processes. In this paper, we describe a subset optimization approach for automatic dataset curation. Focusing on the scenario of audio-visual representation learning, we pose the problem as finding a subset that maximizes the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales. The most important benefit of our approach is scalability. We release the largest video dataset for audio-visual research collected automatically using our approach.

Figure 4: Linear evaluation on downstream tasks. The top-1/​5 accuracy (%) of video classification on UCF101 66, audio classification on ESC-5058 and audio-visual classification on Kinetics-Sounds (KS)4. We group the results by the downstream tasks and by the scale of the pretrain datasets. Baselines are Kinetics-Sounds4 (20K), VGG-Sound11 (200K), and AudioSet19 (2M).

“CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to the “zero-shot” Capabilities of GPT-2 and GPT-3”, Radford et al 2021

“CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of GPT-2 and GPT-3”⁠, Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal (2021-01-05; ⁠, ⁠, ⁠, ⁠, ; backlinks; similar):

[CLIP paper] We present a neural network that aims to address these problems: it is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. This is a key change: by not directly optimizing for the benchmark, we show that it becomes much more representative: our system closes this “robustness gap” by up to 75% while matching the performance of the original ResNet50 on ImageNet zero-shot without using any of the original 1.28M labeled examples.

Approach: We show that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets. Our method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.

In order to solve this task, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it.

  1. CLIP is highly efficient…In the end, our best performing CLIP model trains on 256 GPUs for 2 weeks which is similar to existing large scale image models.
  2. CLIP is flexible and general: Because they learn a wide range of visual concepts directly from natural language, CLIP models are substantially more flexible and general than existing ImageNet models. We find they are able to zero-shot perform many different tasks. To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR. [While CLIP’s zero-shot OCR performance is mixed, its semantic OCR representation is quite useful. When evaluated on the SST-2 NLP dataset rendered as images, a linear classifier on CLIP’s representation matches a CBoW model with direct access to the text. CLIP is also competitive at detecting hateful memes without needing ground truth text.] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models.

…CLIP allows people to design their own classifiers and removes the need for task-specific training data. [See also “AudioCLIP: Extending CLIP to Image, Text and Audio”⁠, Guzhov et al 2021; CLIP notebook compilation for art⁠, “Alien Dreams: An Emerging Art Scene”⁠/​“AI Generated Art Scene Explodes as Hackers Create Groundbreaking New Tools”⁠.]

“Learning Transferable Visual Models From Natural Language Supervision”, Radford et al 2021

“Learning Transferable Visual Models From Natural Language Supervision”⁠, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al (2021-01-05; ; backlinks; similar):

[Blog] State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision.

We demonstrate that the simple pre-training [contrastive learning] task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification.

The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.

Figure 4: Prompt engineering and ensembling improve zero-shot performance. Compared to the baseline of using contextless class names, prompt engineering and ensembling boost zero-shot classification performance by almost 5 points on average across 36 datasets. This improvement is similar to the gain from using 4× more compute with the baseline zero-shot method but is “free” when amortized over many predictions.
Figure 5: Zero-shot CLIP is competitive with a fully supervised baseline. Across a 27 dataset eval suite, a zero-shot CLIP classifier outperforms a fully supervised linear classifier fitted on ResNet-50 features on 16 datasets, including ImageNet.
Figure 9: Zero-shot CLIP performance scales smoothly as a function of model compute. Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44× range of compute spanning 5 different CLIP models. Lightly shaded lines are performance on individual evals, showing that performance is much more varied despite the smooth overall trend.
Figure 13: Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models. (Left) An ideal robust model (dashed line) performs equally well on the ImageNet distribution and on other natural image distributions. Zero-shot CLIP models shrink this “robustness gap” by up to 75%. Linear fits on logit transformed values are shown with bootstrap estimated 95% confidence intervals. (Right) Visualizing distribution shift for bananas, a class shared across 5 of the 7 natural distribution shift datasets. The performance of the best zero-shot CLIP model, ViT-L/​14@336px, is compared with a model that has the same performance on the ImageNet validation set, ResNet-101.
Figure 21: Visualization of predictions from 36 CLIP zero-shot classifiers. All examples are random with the exception of reselecting Hateful Memes to avoid offensive content. The predicted probability of the top 5 classes is shown along with the text used to represent the class. When more than one template is used, the first template is shown. The ground truth label is colored green while an incorrect prediction is colored orange.

[Evaluations: Food101 · Sun398 · Youtube-BB · EuroSAT · PatchCamelyon (PCam) · ImageNet-A (Adversarial) · CIFAR-10 · CLEVR Count · Facial Emotion Recognition 2013 (FER2013) · UCF101 · Caltech-101 · ImageNet-R (Rendition) · Oxford-IIIT Pets · CIFAR-100 · ImageNet-V2 Matched Frequency · FGVC Aircraft · Country211 · RESISC45 · Stanford Cars · SUN · Kinetics-700 · Flowers-102 · ImageNet · Birdsnap · aYahoo · ObjectNet ImageNet Overlap · ImageNet Blurry · Describable Textures Dataset (DTD) · PASCAL VOC 2007 · MNIST · Street View House Numbers (SVHN) · ImageNet Vid · ImageNet Sketch · Hateful Memes · Stanford Sentiment Treebank · German Traffic Sign Recognition Benchmark (GTSRB)]

“Transformers in Vision: A Survey”, Khan et al 2021

“Transformers in Vision: A Survey”⁠, Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah (2021-01-04; similar):

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks eg. Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (eg. images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (eg. image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (eg. visual-question answering, visual reasoning, and visual grounding), video processing (eg. activity recognition, video forecasting), low-level vision (eg. image super-resolution, image enhancement, and colorization) and 3D analysis (eg. point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

“Object-based Attention for Spatio-temporal Reasoning: Outperforming Neuro-symbolic Models With Flexible Distributed Architectures”, Ding et al 2020

“Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures”⁠, David Ding, Felix Hill, Adam Santoro, Matt Botvinick (2020-12-15; similar):

Neural networks have achieved success in a wide array of perceptual tasks, but it is often stated that they are incapable of solving tasks that require higher-level reasoning. Two new task domains, CLEVRER and CATER⁠, have recently been developed to focus on reasoning, as opposed to perception, in the context of spatio-temporal interactions between objects. Initial experiments on these domains found that neuro-symbolic approaches, which couple a logic engine and language parser with a neural perceptual front-end, substantially outperform fully-learned distributed networks, a finding that was taken to support the above thesis.

Here, we show on the contrary that a fully-learned neural network with the right inductive biases can perform substantially better than all previous neural-symbolic models on both of these tasks, particularly on questions that most emphasize reasoning over perception. Our model makes critical use of both self-attention and learned “soft” object-centric representations, as well as BERT-style semi-supervised predictive losses. These flexible biases allow our model to surpass the previous neuro-symbolic state-of-the-art using less than 60% of available labeled data.

Together, these results refute the neuro-symbolic thesis laid out by previous work involving these datasets, and they provide evidence that neural networks can indeed learn to reason effectively about the causal, dynamic structure of physical events.

“Accuracy and Performance Comparison of Video Action Recognition Approaches”, Hutchinson et al 2020

“Accuracy and Performance Comparison of Video Action Recognition Approaches”⁠, Matthew Hutchinson, Siddharth Samsi, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Micheal Houle et al (2020-08-20; ; similar):

Over the past few years, there has been substantial interest in video action recognition systems and models. However, direct comparison of accuracy and computational performance results remain clouded by differing training environments, hardware specifications, hyperparameters, pipelines, and inference methods. This article provides a direct comparison between fourteen off-the-shelf and state-of-the-art models by ensuring consistency in these training characteristics in order to provide readers with a meaningful comparison across different types of video action recognition algorithms. Accuracy of the models is evaluated using standard Top-1 and Top-5 accuracy metrics in addition to a proposed new accuracy metric. Additionally, we compare computational performance of distributed training from two to sixty-four GPUs on a state-of-the-art HPC system.

[Keywords: action recognition, neural network, deep learning, accuracy metrics, computational performance]

[Jack Clark’s summary:

Which is the best system for video action recognition? Simple 2D convnets, says survey:

…Richard Sutton’s ‘bitter lesson’ strikes again…

Researchers with MIT have analyzed the performance of fourteen different models used for video action recognition—correctly labeling something in a video, a generically useful AI capability. The results show that simple techniques tend to beat complex ones. Specifically, the researchers benchmark a range of 2D convolutional networks (C2Ds) against temporal segment networks (TSNs), Long-Term Recurrent Convolutional Neural Nets (LCRNs) and Temporal Shift Modules (TSMs). They find the simple stuff—2D convnets—perform best.

The bitter lesson results: Convolutional net models “significantly outperform” the other models they test. Specifically, the Inception-ResNet-v2, ResNet50, DenseNet201, and MobileNetv2 are all top performers. These results also highlight some of the ideas in Sutton’s ‘bitter lesson’ essay—namely that simpler things that scale better tend to beat the smart stuff. “2D approaches can yield results comparable to their more complex 3D counterparts, and model depth, rather than input feature scale, is the critical component to an architecture’s ability to extract a video’s semantic action information”, they write.]

“Self-supervised Learning through the Eyes of a Child”, Orhan et al 2020

“Self-supervised learning through the eyes of a child”⁠, A. Emin Orhan, Vaibhav V. Gupta, Brenden M. Lake (2020-07-31; ⁠, ; similar):

Within months of birth, children develop meaningful expectations about the world around them. How much of this early knowledge can be explained through generic learning mechanisms applied to sensory data, and how much of it requires more substantive innate inductive biases? Addressing this fundamental question in its full generality is currently infeasible, but we can hope to make real progress in more narrowly defined domains, such as the development of high-level visual categories, thanks to improvements in data collecting technology and recent progress in deep learning. In this paper, our goal is precisely to achieve such progress by utilizing modern self-supervised deep learning methods and a recent longitudinal, egocentric video dataset recorded from the perspective of three young children (Sullivan et al 2020). Our results demonstrate the emergence of powerful, high-level visual representations from developmentally realistic natural videos using generic self-supervised learning objectives.

“Gesticulator: A Framework for Semantically-aware Speech-driven Gesture Generation”, Kucherenko et al 2020

“Gesticulator: A framework for semantically-aware speech-driven gesture generation”⁠, Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite et al (2020-01-25; ; backlinks; similar):

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (eg. raising a hand when saying “high”): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page https: /  ​ /  ​ /  ​gesticulator⁠.

“SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded from the Infant’s Perspective”, Sullivan et al 2020

“SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective”⁠, Jess Sullivan, Michelle Mei, Amy Perfors, Erica Wojcik, Michael Frank (2020-01-14; ; similar):

We introduce a new resource: the SAYCam corpus. Infants aged 6–32 months wore a head-mounted camera for ~2 hours per week, over the course of ~two and a half years. The result is a large, naturalistic, longitudinal dataset of infant-perspective and child-perspective videos. Transcription efforts are underway, with over 200,000 words of naturalistic dialogue already transcribed. Similarly, the dataset is searchable using a number of criteria (eg. age of participant, location, setting, objects present). The resulting dataset will be of broad use to psychologists, linguists, and computer scientists.

“Axial Attention in Multidimensional Transformers”, Ho et al 2019

“Axial Attention in Multidimensional Transformers”⁠, Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans (2019-12-20; backlinks; similar):

We propose Axial Transformers, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors. Existing autoregressive models either suffer from excessively large computational resource requirements for high dimensional data, or make compromises in terms of distribution expressiveness or ease of implementation in order to decrease resource requirements. Our architecture, by contrast, maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation and achieving state-of-the-art results on standard generative modeling benchmarks.

Our models are based on axial attention, a simple generalization of self-attention that naturally aligns with the multiple dimensions of the tensors in both the encoding and the decoding settings. Notably the proposed structure of the layers allows for the vast majority of the context to be computed in parallel during decoding without introducing any independence assumptions. This semi-parallel structure goes a long way to making decoding from even a very large Axial Transformer broadly applicable.

We demonstrate state-of-the-art results for the Axial Transformer on the ImageNet-32 and ImageNet-64 image benchmarks as well as on the BAIR Robotic Pushing video benchmark. We open source the implementation of Axial Transformers.

“CATER: A Diagnostic Dataset for Compositional Actions and TEmporal Reasoning”, Girdhar & Ramanan 2019

“CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning”⁠, Rohit Girdhar, Deva Ramanan (2019-10-10; backlinks; similar):

Computer vision has undergone a dramatic revolution in performance, driven in large part through deep features trained on large-scale supervised datasets. However, much of these improvements have focused on static image analysis; video understanding has seen rather modest improvements. Even though new datasets and spatiotemporal models have been proposed, simple frame-by-frame classification methods often still remain competitive. We posit that current video datasets are plagued with implicit biases over scene and object structure that can dwarf variations in temporal structure.

In this work, we build a video dataset with fully observable and controllable object and scene bias, and which truly requires spatiotemporal understanding in order to be solved. Our dataset, named CATER, is rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning.

In addition to being a challenging dataset, CATER also provides a plethora of diagnostic tools to analyze modern spatiotemporal video architectures by being completely observable and controllable. Using CATER, we provide insights into some of the most recent state of the art deep video architectures.

“CLEVRER: CoLlision Events for Video REpresentation and Reasoning”, Yi et al 2019

“CLEVRER: CoLlision Events for Video REpresentation and Reasoning”⁠, Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum (2019-10-03; backlinks; similar):

The ability to reason about temporal and causal events from videos lies at the core of human intelligence. Most video reasoning benchmarks, however, focus on pattern recognition from complex visual and language input, instead of on causal structure. We study the complementary problem, exploring the temporal and causal structures behind videos of objects with simple visual appearance.

To this end, we introduce the CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks. Motivated by the theory of human casual judgment, CLEVRER includes four types of questions: descriptive (eg. “what color”), explanatory (“what is responsible for”), predictive (“what will happen next”), and counterfactual (“what if”).

We evaluate various state-of-the-art models for visual reasoning on our benchmark. While these models thrive on the perception-based task (descriptive), they perform poorly on the causal tasks (explanatory, predictive and counterfactual), suggesting that a principled approach for causal reasoning should incorporate the capability of both perceiving complex visual and language inputs, and understanding the underlying dynamics and causal relations. We also study an oracle model that explicitly combines these components via symbolic representations.

“A Short Note on the Kinetics-700 Human Action Dataset”, Carreira et al 2019

“A Short Note on the Kinetics-700 Human Action Dataset”⁠, Joao Carreira, Eric Noland, Chloe Hillier, Andrew Zisserman (2019-07-15; backlinks):

We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos.

This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.

“NoGAN: Decrappification, DeOldification, and Super Resolution”, Antic et al 2019

“NoGAN: Decrappification, DeOldification, and Super Resolution”⁠, Jason Antic, Jeremy Howard, Uri Manor (2019-05-03; ⁠, ⁠, ⁠, ; backlinks; similar):

Generative models are models that generate music, images, text, and other complex data types. In recent years generative models have advanced at an astonishing rate, largely due to deep learning, and particularly due to generative adversarial models (GANs). However, GANs are notoriously difficult to train, due to requiring a large amount of data, needing many GPUs and a lot of time to train, and being highly sensitive to minor hyperparameter changes. has been working in recent years towards making a range of models easier and faster to train, with a particular focus on using transfer learning. Transfer learning refers to pre-training a model using readily available data and quick and easy to calculate loss functions, and then fine-tuning that model for a task that may have fewer labels, or be more expensive to compute. This seemed like a potential solution to the GAN training problem, so in late 2018 worked on a transfer learning technique for generative modeling⁠.

The pre-trained model that selected was this: Start with an image dataset and “crappify” the images, such as reducing the resolution, adding jpeg artifacts, and obscuring parts with random text. Then train a model to “decrappify” those images to return them to their original state. started with a model that was pre-trained for ImageNet classification, and added a U-Net upsampling network, adding various modern tweaks to the regular U-Net. A simple fast loss function was initially used: mean squared pixel error. This U-Net could be trained in just a few minutes. Then, the loss function was replaced was a combination of other loss functions used in the generative modeling literature (more details in the f8 video) and trained for another couple of hours. The plan was then to finally add a GAN for the last few epochs—however it turned out that the results were so good that ended up not using a GAN for the final models….

NoGAN Training: NoGAN is a new and exciting technique in GAN training that we developed, in pursuit of higher quality and more stable renders. How, and how well, it works is a bit surprising.

Here is the NoGAN training process:

  1. Pretrain the Generator. The generator is first trained in a more conventional and easier to control manner—with Perceptual Loss (aka Feature Loss) by itself. GAN training is not introduced yet. At this point you’re training the generator as best as you can in the easiest way possible. This takes up most of the time in NoGAN training. Keep in mind: this pretraining by itself will get the generator model far. Colorization will be well-trained as a task, albeit the colors will tend toward dull tones. Self-Attention will also be well-trained at the at this stage, which is very important.

  2. Save Generated Images From Pretrained Generator.

  3. Pretrain the Critic as a Binary Classifier. Much like in pretraining the generator, what we aim to achieve in this step is to get as much training as possible for the critic in a more “conventional” manner which is easier to control. And there’s nothing easier than a binary classifier! Here we’re training the critic as a binary classifier of real and fake images, with the fake images being those saved in the previous step. A helpful thing to keep in mind here is that you can simply use a pre-trained critic used for another image-to-image task and refine it. This has already been done for super-resolution, where the critic’s pretrained weights were loaded from that of a critic trained for colorization. All that is needed to make use of the pre-trained critic in this case is a little fine-tuning.

  4. Train Generator and Critic in (Almost) Normal GAN Setting. Quickly! This is the surprising part. It turns out that in this pretraining scenario, the critic will rapidly drive adjustments in the generator during GAN training. This happens during a narrow window of time before an “inflection point” of sorts is hit. After this point, there seems to be little to no benefit in training any further in this manner. In fact, if training is continued after this point, you’ll start seeing artifacts and glitches introduced in renderings.

In the case of DeOldify, training to this point requires iterating through only about 1% to 3% of ImageNet data (or roughly 2600 to 7800 iterations on a batch size of five). This amounts to just around 30–90 minutes of GAN training, which is in stark contrast to the three to five days of progressively-sized GAN training that was done previously. Surprisingly, during that short amount of training, the change in the quality of the renderings is dramatic. In fact, this makes up the entirety of GAN training for the video model. The “artistic” and “stable” models go one step further and repeat the NoGAN training process steps 2–4 until there’s no more apparent benefit (around five repeats).

Note: a small but important change to this GAN training that deviates from conventional GANs is the use of a loss threshold that must be met by the critic before generator training commences. Until then, the critic continues training to “catch up” in order to be able to provide the generator with constructive gradients. This catch up chiefly takes place at the beginning of GAN training which immediately follows generator and critic pretraining.

“Billion-scale Semi-supervised Learning for Image Classification”, Yalniz et al 2019

“Billion-scale semi-supervised learning for image classification”⁠, I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, Dhruv Mahajan (2019-05-02; ; similar):

This paper presents a study of semi-supervised learning with large convolutional networks. We propose a pipeline, based on a teacher/​student paradigm, that leverages a large collection of unlabeled images (up to 1 billion). Our main goal is to improve the performance for a given target architecture, like ResNet-50 or ResNext. We provide an extensive analysis of the success factors of our approach, which leads us to formulate some recommendations to produce high-accuracy models for image classification with semi-supervised learning. As a result, our approach brings important gains to standard architectures for image, video and fine-grained classification. For instance, by leveraging one billion unlabeled images, our learned vanilla ResNet-50 achieves 81.2% top-1 accuracy on the ImageNet benchmark.

“VideoBERT: A Joint Model for Video and Language Representation Learning”, Sun et al 2019

“VideoBERT: A Joint Model for Video and Language Representation Learning”⁠, Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid (2019-04-03; ; backlinks; similar):

Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use VideoBERT in numerous tasks, including action classification and video captioning. We show that it can be applied directly to open-vocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-the-art on video captioning, and quantitative results verify that the model learns high-level semantic features.

“CCNet: Criss-Cross Attention for Semantic Segmentation”, Huang et al 2018

“CCNet: Criss-Cross Attention for Semantic Segmentation”⁠, Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang (2018-11-28; backlinks; similar):

Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: (1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. (2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. (3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at⁠.

“Evolving Space-Time Neural Architectures for Videos”, Piergiovanni et al 2018

“Evolving Space-Time Neural Architectures for Videos”⁠, AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo (2018-11-26; ; backlinks; similar):

We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos. Previous work, taking advantage of 3D convolutions, obtained promising results by manually designing video CNN architectures. We here develop a novel evolutionary search algorithm that automatically explores models with different types and combinations of layers to jointly learn interactions between spatial and temporal aspects of video representations. We demonstrate the generality of this algorithm by applying it to two meta-architectures, obtaining new architectures superior to manually designed architectures. Further, we propose a new component, the iTGM layer, which more efficiently utilizes its parameters to allow learning of space-time interactions over longer time horizons. The iTGM layer is often preferred by the evolutionary algorithm and allows building cost-efficient networks. The proposed approach discovers new and diverse video architectures that were previously unknown. More importantly they are both more accurate and faster than prior models, and outperform the state-of-the-art results on multiple datasets we test, including HMDB, Kinetics, and Moments in Time. We will open source the code and models, to encourage future model development.

“Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”, Peng et al 2018

“Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”⁠, Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, Sergey Levine (2018-10-01; ⁠, ⁠, ⁠, ; backlinks; similar):

Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to constrain information flow in the discriminator by means of an information bottleneck. By enforcing a constraint on the mutual information between the observations and the discriminator’s internal representation, we can effectively modulate the discriminator’s accuracy and maintain useful and informative gradients.

We demonstrate that our proposed variational discriminator bottleneck (VDB) leads to significant improvements across three distinct application areas for adversarial learning algorithms. Our primary evaluation studies the applicability of the VDB to imitation learning of dynamic continuous control skills, such as running. We show that our method can learn such skills directly from raw video demonstrations, substantially outperforming prior adversarial imitation learning methods. The VDB can also be combined with adversarial inverse reinforcement learning to learn parsimonious reward functions that can be transferred and re-optimized in new settings. Finally, we demonstrate that VDB can train GANs more effectively for image generation, improving upon a number of prior stabilization methods.

“Large-Scale Visual Speech Recognition”, Shillingford et al 2018

“Large-Scale Visual Speech Recognition”⁠, Brendan Shillingford, Yannis Assael, Matthew W. Hoffman, Thomas Paine, Cían Hughes, Utsav Prabhu, Hank Liao et al (2018-07-13; ; similar):

This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on other lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.

“Playing Hard Exploration Games by Watching YouTube”, Aytar et al 2018

“Playing hard exploration games by watching YouTube”⁠, Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas (2018-05-29; ; similar):

Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent’s exact environment setup and the demonstrator’s action and reward trajectories.

Here we propose a two-stage method that overcomes these limitations by relying on noisy, unaligned footage without access to such data. First, we learn to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (ie. vision and sound). Second, we embed a single YouTube video in this representation to construct a reward function that encourages an agent to imitate human gameplay. This method of one-shot imitation allows our agent to convincingly exceed human-level performance on the infamously hard exploration games Montezuma’s Revenge, Pitfall! and Private Eye for the first time, even if the agent is not presented with any environment rewards.

“The Sound of Pixels”, Zhao et al 2018

“The Sound of Pixels”⁠, Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba (2018-04-09; ; similar):

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mix-and-Separate framework outperforms several baselines on source separation. Qualitative results suggest our model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources.

“One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”, Yu et al 2018

“One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”⁠, Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, Sergey Levine (2018-02-05; ⁠, ; similar):

Humans and animals are capable of learning a new behavior by observing others perform the skill just once. We consider the problem of allowing a robot to do the same—learning from a raw video pixels of a human, even when there is substantial domain shift in the perspective, environment, and embodiment between the robot and the observed human. Prior approaches to this problem have hand-specified how human and robot actions correspond and often relied on explicit human pose detection systems. In this work, we present an approach for one-shot learning from a video of a human by using human and robot demonstration data from a variety of previous tasks to build up prior knowledge through meta-learning. Then, combining this prior knowledge and only a single video demonstration from a human, the robot can perform the task that the human demonstrated.

We show experiments on both a PR2 arm and a Sawyer arm, demonstrating that after meta-learning, the robot can learn to place, push, and pick-and-place new objects using just one video of a human performing the manipulation.

“Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition”, Ye et al 2017

“Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition”⁠, Jinmian Ye, Linnan Wang, Guangxi Li, Di Chen, Shandian Zhe, Xinqi Chu, Zenglin Xu (2017-12-14; ⁠, ; similar):

Recurrent Neural Networks (RNNs) are powerful sequence modeling tools. However, when dealing with high dimensional inputs, the training of RNNs becomes computational expensive due to the large number of model parameters. This hinders RNNs from solving many important computer vision tasks, such as Action Recognition in Videos and Image Captioning. To overcome this problem, we propose a compact and flexible structure, namely Block-Term tensor decomposition, which greatly reduces the parameters of RNNs and improves their training efficiency. Compared with alternative low-rank approximations, such as tensor-train RNN (TT-RNN), our method, Block-Term RNN (BT-RNN), is not only more concise (when using the same rank), but also able to attain a better approximation to the original RNNs with much fewer parameters. On three challenging tasks, including Action Recognition in Videos, Image Captioning and Image Generation, BT-RNN outperforms TT-RNN and the standard RNN in terms of both prediction accuracy and convergence rate. Specifically, BT-LSTM utilizes 17,388× fewer parameters than the standard LSTM to achieve an accuracy improvement over 15.6% in the Action Recognition task on the UCF11 dataset.

“Reinforced Video Captioning With Entailment Rewards”, Pasunuru & Bansal 2017

“Reinforced Video Captioning with Entailment Rewards”⁠, Ramakanth Pasunuru, Mohit Bansal (2017-08-07; ; backlinks; similar):

Sequence-to-sequence models have shown promising improvements on the temporal task of video captioning, but they optimize word-level cross-entropy loss during training. First, using policy gradient and mixed-loss methods for reinforcement learning, we directly optimize sentence-level task-based metrics (as rewards), achieving significant improvements over the baseline, based on both automatic metrics and human evaluation on multiple datasets. Next, we propose a novel entailment-enhanced reward (CIDEnt) that corrects phrase-matching based metrics (such as CIDEr) to only allow for logically-implied partial matches and avoid contradictions, achieving further significant improvements over the CIDEr-reward model. Overall, our CIDEnt-reward model achieves the new state-of-the-art on the MSR-VTT dataset.

“Tracking As Online Decision-Making: Learning a Policy from Streaming Videos With Reinforcement Learning”, III & Ramanan 2017

“Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning”⁠, James Steven Supancic III, Deva Ramanan (2017-07-17; ; backlinks; similar):

We formulate tracking as an online decision-making process, where a tracking agent must follow an object despite ambiguous image frames and a limited computational budget. Crucially, the agent must decide where to look in the upcoming frames, when to reinitialize because it believes the target has been lost, and when to update its appearance model for the tracked object. Such decisions are typically made heuristically. Instead, we propose to learn an optimal decision-making policy by formulating tracking as a partially observable decision-making process (POMDP). We learn policies with deep reinforcement learning algorithms that need supervision (a reward signal) only when the track has gone awry. We demonstrate that sparse rewards allow us to quickly train on massive datasets, several orders of magnitude more than past work. Interestingly, by treating the data source of Internet videos as unlimited streams, we both learn and evaluate our trackers in a single, unified computational stream.

“Learning to Learn from Noisy Web Videos”, Yeung et al 2017

“Learning to Learn from Noisy Web Videos”⁠, Serena Yeung, Vignesh Ramanathan, Olga Russakovsky, Liyue Shen, Greg Mori, Li Fei-Fei (2017-06-09; ; backlinks; similar):

Understanding the simultaneously very diverse and intricately fine-grained set of possible human actions is a critical open problem in computer vision. Manually labeling training videos is feasible for some action classes but doesn’t scale to the full long-tailed distribution of actions. A promising way to address this is to leverage noisy data from web queries to learn new actions, using semi-supervised or “webly-supervised” approaches. However, these methods typically do not learn domain-specific knowledge, or rely on iterative hand-tuned data labeling policies. In this work, we instead propose a reinforcement learning-based formulation for selecting the right examples for training a classifier from noisy web search results. Our method uses Q-learning to learn a data labeling policy on a small labeled training dataset, and then uses this to automatically label noisy web data for new visual concepts. Experiments on the challenging Sports-1M action recognition benchmark as well as on additional fine-grained and newly emerging action classes demonstrate that our method is able to learn good labeling policies for noisy data and use this to learn accurate visual concept classifiers.

“Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, Carreira & Zisserman 2017

“Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”⁠, Joao Carreira, Andrew Zisserman (2017-05-22; ; backlinks):

The paucity of videos in current action classification datasets (UCF101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks.

This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset⁠. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics.

We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters

. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF101.

“The Kinetics Human Action Video Dataset”, Kay et al 2017

“The Kinetics Human Action Video Dataset”⁠, Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola et al (2017-05-19; backlinks):

We describe the DeepMind Kinetics human action video dataset.

The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands.

We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.

“Time-Contrastive Networks: Self-Supervised Learning from Video”, Sermanet et al 2017

“Time-Contrastive Networks: Self-Supervised Learning from Video”⁠, Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine (2017-04-23; ; backlinks; similar):

We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https: /  ​ /  ​ /  ​imitate

“LipNet: End-to-End Sentence-level Lipreading”, Assael et al 2016

“LipNet: End-to-End Sentence-level Lipreading”⁠, Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, Nando de Freitas (2016-11-05; backlinks; similar):

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al 2016).

“Deep Visual Foresight for Planning Robot Motion”, Finn & Levine 2016

“Deep Visual Foresight for Planning Robot Motion”⁠, Chelsea Finn, Sergey Levine (2016-10-03; ; similar):

A key challenge in scaling up robot learning to many skills and environments is removing the need for human supervision, so that robots can collect their own data and improve their own performance without being limited by the cost of requesting human feedback. Model-based reinforcement learning holds the promise of enabling an agent to learn to predict the effects of its actions, which could provide flexible predictive models for a wide range of tasks and environments, without detailed human supervision.

We develop a method for combining deep action-conditioned video prediction models with model-predictive control that uses entirely unlabeled training data. Our approach does not require a calibrated camera, an instrumented training set-up, nor precise sensing and actuation. Our results show that our method enables a real robot to perform nonprehensile manipulation—pushing objects—and can handle novel objects not seen during training.

“Artistic Style Transfer for Videos”, Ruder et al 2016

“Artistic style transfer for videos”⁠, Manuel Ruder, Alexey Dosovitskiy, Thomas Brox (2016-04-28; similar):

In the past, manually re-drawing an image in a certain artistic style required a professional artist and a long time. Doing this for a video sequence single-handed was beyond imagination. Nowadays computers provide new possibilities. We present an approach that transfers the style from one image (for example, a painting) to a whole video sequence. We make use of recent advances in style transfer in still images and propose new initializations and loss functions applicable to videos. This allows us to generate consistent and stable stylized video sequences, even in cases with large motion and strong occlusion. We show that the proposed method clearly outperforms simpler baselines both qualitatively and quantitatively.

“YFCC100M: The New Data in Multimedia Research”, Thomee et al 2015

“YFCC100M: The New Data in Multimedia Research”⁠, Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth et al (2015-03-05; ⁠, ⁠, ; backlinks; similar):

We present the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M), the largest public multimedia collection that has ever been released. The dataset contains a total of 100 million media objects, of which ~99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license. Each media object in the dataset is represented by several pieces of metadata, eg. Flickr identifier, owner name, camera, title, tags, geo, media source. The collection provides a comprehensive snapshot of how photos and videos were taken, described, and shared over the years, from the inception of Flickr in 2004 until early 2014. In this article we explain the rationale behind its creation, as well as the implications the dataset has for science, research, engineering, and development. We further present several new challenges in multimedia research that can now be expanded upon with our dataset.

“UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”, Soomro et al 2012

“UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”⁠, Khurram Soomro, Amir Roshan Zamir, Mubarak Shah (2012-12-03; backlinks):

We introduce UCF101 which is currently the largest dataset of human actions.

It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background.

Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.

To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.