video analysis tag

Gwern Branwen

See Also
Links
Miscellaneous
Link Bibliography

[Warning: JavaScript Disabled!]

[For support of key website features (link annotation popups/popovers & transclusions, collapsible sections, backlinks, tablesorting, image zooming, sidenotes etc), you must enable JavaScript.]

Links

“InternVid: A Large-Scale Video-Text Dataset for Multimodal Understanding and Generation”, Wang et al 2023

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

“Test-Time Training on Video Streams”, Wang et al 2023

Test-Time Training on Video Streams

“Magenta Green Screen: Spectrally Multiplexed Alpha Matting With Deep Colorization”, Smirnov et al 2023

Magenta Green Screen: Spectrally Multiplexed Alpha Matting with Deep Colorization

“PaLI-X: On Scaling up a Multilingual Vision and Language Model”, Chen et al 2023

PaLI-X: On Scaling up a Multilingual Vision and Language Model

“ImageBind: One Embedding Space To Bind Them All”, Girdhar et al 2023

ImageBind: One Embedding Space To Bind Them All

“Scaling Vision Transformers to 22 Billion Parameters”, Dehghani et al 2023

Scaling Vision Transformers to 22 Billion Parameters

“Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Yan et al 2022

Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

“VindLU: A Recipe for Effective Video-And-Language Pretraining”, Cheng et al 2022

VindLU: A Recipe for Effective Video-and-Language Pretraining

“Videogenic: Video Highlights via Photogenic Moments”, Lin et al 2022

Videogenic: Video Highlights via Photogenic Moments

“AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies”, Siyao et al 2022

AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies

“Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends”, Gan et al 2022

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

“TVLT: Textless Vision-Language Transformer”, Tang et al 2022

TVLT: Textless Vision-Language Transformer

“EVL: Frozen CLIP Models Are Efficient Video Learners”, Lin et al 2022

EVL: Frozen CLIP Models are Efficient Video Learners

“X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Ni et al 2022

X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition

“X-CLIP: End-To-End Multi-Grained Contrastive Learning for Video-Text Retrieval”, Ma et al 2022

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

“Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, Baker et al 2022

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

“OmniMAE: Single Model Masked Pretraining on Images and Videos”, Girdhar et al 2022

OmniMAE: Single Model Masked Pretraining on Images and Videos

“LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Li et al 2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

“MLP-3D: A MLP-Like 3D Architecture With Grouped Time Mixing”, Qiu et al 2022

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

“Uni-Perceiver-MoE: Learning Sparse Generalist Models With Conditional MoEs”, Zhu et al 2022

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

“Revisiting the "Video" in Video-Language Understanding”, Buch et al 2022

Revisiting the "Video" in Video-Language Understanding

“VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Wang et al 2022

VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

“Masked Autoencoders As Spatiotemporal Learners”, Feichtenhofer et al 2022

Masked Autoencoders As Spatiotemporal Learners

“Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-Time Planning”, Qi et al 2022

Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning

“ViS4mer: Long Movie Clip Classification With State-Space Video Models”, Islam & Bertasius 2022

ViS4mer: Long Movie Clip Classification with State-Space Video Models

“Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Zeng et al 2022

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

“Reinforcement Learning With Action-Free Pre-Training from Videos”, Seo et al 2022

Reinforcement Learning with Action-Free Pre-Training from Videos

“CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-Shot Transfer Learning”, Taesiri et al 2022

CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

“Robot Peels Banana With Goal-Conditioned Dual-Action Deep Imitation Learning”, Kim et al 2022

Robot peels banana with goal-conditioned dual-action deep imitation learning

“Hierarchical Perceiver”, Carreira et al 2022

Hierarchical Perceiver

“MuZero With Self-Competition for Rate Control in VP9 Video Compression”, Mandhane et al 2022

MuZero with Self-competition for Rate Control in VP9 Video Compression

“BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation”, Li et al 2022

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

“MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”, Wu et al 2022

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

“CAST: Character Labeling in Animation Using Self-Supervision by Tracking”, Nir et al 2022

CAST: Character labeling in Animation using Self-supervision by Tracking

“AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”, Shi et al 2022

AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

“Noether Networks: Meta-Learning Useful Conserved Quantities”, Alet et al 2021

Noether Networks: Meta-Learning Useful Conserved Quantities

“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Soldan et al 2021

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, Zhang et al 2021

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

“Florence: A New Foundation Model for Computer Vision”, Yuan et al 2021

Florence: A New Foundation Model for Computer Vision

“Scaling ASR Improves Zero and Few Shot Learning”, Xiao et al 2021

Scaling ASR Improves Zero and Few Shot Learning

“ADOP: Approximate Differentiable One-Pixel Point Rendering”, Rückert et al 2021

ADOP: Approximate Differentiable One-Pixel Point Rendering

“VideoCLIP: Contrastive Pre-Training for Zero-Shot Video-Text Understanding”, Xu et al 2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Jaegle et al 2021

Perceiver IO: A General Architecture for Structured Inputs & Outputs

“CLIP-It! Language-Guided Video Summarization”, Narasimhan et al 2021

CLIP-It! Language-Guided Video Summarization

“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Fang et al 2021

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

“Revisiting ResNets: Improved Training and Scaling Strategies”, Bello et al 2021

Revisiting ResNets: Improved Training and Scaling Strategies

“Learning from Videos to Understand the World”, Zweig et al 2021

Learning from videos to understand the world

“Perceiver: General Perception With Iterative Attention”, Jaegle et al 2021

Perceiver: General Perception with Iterative Attention

“Video Transformer Network”, Neimark et al 2021

Video Transformer Network

“Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning”, Lee et al 2021

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

“MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”, Xu et al 2021

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

“CLIP: Learning Transferable Visual Models From Natural Language Supervision”, Radford et al 2021

CLIP: Learning Transferable Visual Models From Natural Language Supervision

“Transformers in Vision: A Survey”, Khan et al 2021

Transformers in Vision: A Survey

“Object-Based Attention for Spatio-Temporal Reasoning: Outperforming Neuro-Symbolic Models With Flexible Distributed Architectures”, Ding et al 2020

Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures

“Accuracy and Performance Comparison of Video Action Recognition Approaches”, Hutchinson et al 2020

Accuracy and Performance Comparison of Video Action Recognition Approaches

“Self-Supervised Learning through the Eyes of a Child”, Orhan et al 2020

Self-supervised learning through the eyes of a child

“Gesticulator: A Framework for Semantically-Aware Speech-Driven Gesture Generation”, Kucherenko et al 2020

Gesticulator: A framework for semantically-aware speech-driven gesture generation

“SAYCam: A Large, Longitudinal Audiovisual Dataset Recorded from the Infant’s Perspective”, Sullivan et al 2020

SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective

“Axial Attention in Multidimensional Transformers”, Ho et al 2019

Axial Attention in Multidimensional Transformers

“CATER: A Diagnostic Dataset for Compositional Actions and TEmporal Reasoning”, Girdhar & Ramanan 2019

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

“CLEVRER: CoLlision Events for Video REpresentation and Reasoning”, Yi et al 2019

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

“Training Kinetics in 15 Minutes: Large-Scale Distributed Training on Videos”, Lin et al 2019

Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos

“A Short Note on the Kinetics-700 Human Action Dataset”, Carreira et al 2019

A Short Note on the Kinetics-700 Human Action Dataset

“Billion-Scale Semi-Supervised Learning for Image Classification”, Yalniz et al 2019

Billion-scale semi-supervised learning for image classification

“VideoBERT: A Joint Model for Video and Language Representation Learning”, Sun et al 2019

VideoBERT: A Joint Model for Video and Language Representation Learning

“Real-Time Continuous Transcription With Live Transcribe”, Savla 2019

Real-time Continuous Transcription with Live Transcribe

“CCNet: Criss-Cross Attention for Semantic Segmentation”, Huang et al 2018

CCNet: Criss-Cross Attention for Semantic Segmentation

“Evolving Space-Time Neural Architectures for Videos”, Piergiovanni et al 2018

Evolving Space-Time Neural Architectures for Videos

“Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow”, Peng et al 2018

Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

“A Short Note about Kinetics-600”, Carreira et al 2018

A Short Note about Kinetics-600

“Large-Scale Visual Speech Recognition”, Shillingford et al 2018

Large-Scale Visual Speech Recognition

“Playing Hard Exploration Games by Watching YouTube”, Aytar et al 2018

Playing hard exploration games by watching YouTube

“BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning”, Yu et al 2018

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

“The Sound of Pixels”, Zhao et al 2018

The Sound of Pixels

“One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning”, Yu et al 2018

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

“Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition”, Ye et al 2017

Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition

“Reinforced Video Captioning With Entailment Rewards”, Pasunuru & Bansal 2017

Reinforced Video Captioning with Entailment Rewards

“Tracking As Online Decision-Making: Learning a Policy from Streaming Videos With Reinforcement Learning”, III & Ramanan 2017

Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning

“Learning to Learn from Noisy Web Videos”, Yeung et al 2017

Learning to Learn from Noisy Web Videos

“Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, Carreira & Zisserman 2017

Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset

“The Kinetics Human Action Video Dataset”, Kay et al 2017

The Kinetics Human Action Video Dataset

“Dense-Captioning Events in Videos”, Krishna et al 2017

Dense-Captioning Events in Videos

“Time-Contrastive Networks: Self-Supervised Learning from Video”, Sermanet et al 2017

Time-Contrastive Networks: Self-Supervised Learning from Video

“LipNet: End-To-End Sentence-Level Lipreading”, Assael et al 2016

LipNet: End-to-End Sentence-level Lipreading

“Deep Visual Foresight for Planning Robot Motion”, Finn & Levine 2016

Deep Visual Foresight for Planning Robot Motion

“Temporal Convolutional Networks: A Unified Approach to Action Segmentation”, Lea et al 2016

Temporal Convolutional Networks: A Unified Approach to Action Segmentation

“Clockwork Convnets for Video Semantic Segmentation”, Shelhamer et al 2016

Clockwork Convnets for Video Semantic Segmentation

“Artistic Style Transfer for Videos”, Ruder et al 2016

Artistic style transfer for videos

“YFCC100M: The New Data in Multimedia Research”, Thomee et al 2015

YFCC100M: The New Data in Multimedia Research

“UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”, Soomro et al 2012

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Sort By Magic

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Miscellaneous

Link Bibliography

https://arxiv.org/abs/2307.05014: “Test-Time Training on Video Streams”, Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang

link-bibliography
https://arxiv.org/abs/2305.05665#facebook: “ImageBind: One Embedding Space To Bind Them All”, Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Arm Holdings, Joulin

link-bibliography
https://arxiv.org/abs/2302.05442#google: “Scaling Vision Transformers to 22 Billion Parameters”, Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner

link-bibliography
https://arxiv.org/abs/2212.04979#google: “Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu

link-bibliography
https://arxiv.org/abs/2212.05051: “VindLU: A Recipe for Effective Video-And-Language Pretraining”, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

link-bibliography
https://arxiv.org/abs/2209.14156: “TVLT: Textless Vision-Language Transformer”, Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal

link-bibliography
https://arxiv.org/abs/2208.03550: “EVL: Frozen CLIP Models Are Efficient Video Learners”, Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

link-bibliography
https://arxiv.org/abs/2207.07285#alibaba: “X-CLIP: End-To-End Multi-Grained Contrastive Learning for Video-Text Retrieval”, Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji

link-bibliography
https://arxiv.org/abs/2206.11795#openai: “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro

link-bibliography
https://arxiv.org/abs/2206.08356#facebook: “OmniMAE: Single Model Masked Pretraining on Images and Videos”, Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Arm Holdings, Joulin, Ishan Misra

link-bibliography
https://arxiv.org/abs/2206.07160#microsoft: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang

link-bibliography
https://arxiv.org/abs/2205.10747: “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang

link-bibliography
https://arxiv.org/abs/2205.09113#facebook: “Masked Autoencoders As Spatiotemporal Learners”, Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He

link-bibliography
https://arxiv.org/abs/2204.00598#google: “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo

link-bibliography
https://arxiv.org/abs/2203.11096: “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-Shot Transfer Learning”, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer

link-bibliography
https://arxiv.org/abs/2201.12086#salesforce: “BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation”, Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

link-bibliography
https://arxiv.org/abs/2111.11432#microsoft: “Florence: A New Foundation Model for Computer Vision”, Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang

link-bibliography
https://arxiv.org/abs/2107.14795#deepmind: “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula

link-bibliography
https://arxiv.org/abs/2106.11097: “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen

link-bibliography
https://arxiv.org/abs/2103.07579#google: “Revisiting ResNets: Improved Training and Scaling Strategies”, Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens

link-bibliography
https://ai.facebook.com/blog/learning-from-videos-to-understand-the-world/: “Learning from Videos to Understand the World”, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan

link-bibliography
https://arxiv.org/abs/2103.03206#deepmind: “Perceiver: General Perception With Iterative Attention”, Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

link-bibliography
https://arxiv.org/abs/2102.00719: “Video Transformer Network”, Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann

link-bibliography
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf: “CLIP: Learning Transferable Visual Models From Natural Language Supervision”, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry

link-bibliography
https://arxiv.org/abs/2012.08508#deepmind: “Object-Based Attention for Spatio-Temporal Reasoning: Outperforming Neuro-Symbolic Models With Flexible Distributed Architectures”, David Ding, Felix Hill, Adam Santoro, Matt Botvinick

link-bibliography
https://arxiv.org/abs/2008.09037: “Accuracy and Performance Comparison of Video Action Recognition Approaches”, Matthew Hutchinson, Siddharth Samsi, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Micheal Houle

link-bibliography
https://arxiv.org/abs/1905.00546#facebook: “Billion-Scale Semi-Supervised Learning for Image Classification”, I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, Dhruv Mahajan

link-bibliography
https://arxiv.org/abs/1811.11721: “CCNet: Criss-Cross Attention for Semantic Segmentation”, Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang

link-bibliography
https://arxiv.org/abs/1808.01340#deepmind: “A Short Note about Kinetics-600”, Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman

link-bibliography
https://arxiv.org/abs/1705.07750#deepmind: “Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, Joao Carreira, Andrew Zisserman

link-bibliography
https://arxiv.org/abs/1608.03609: “Clockwork Convnets for Video Semantic Segmentation”, Evan Shelhamer, Kate Rakelly, Judy Hoffman, Trevor Darrell

link-bibliography

[Quote Of The Day]

[Site Of The Day]

[Annotation Of The Day]

[adblock public service announcement]