- See Also
- Gwern
-
Links
- “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training”, McKinzie et al 2024
- “Discovering Universal Semantic Triggers for Text-to-Image Synthesis”, Zhai et al 2024
- “TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones”, Yuan et al 2023
- “Parrot Captions Teach CLIP to Spot Text”, Lin et al 2023
- “StarVector: Generating Scalable Vector Graphics Code from Images”, Rodriguez et al 2023
- “Vision-Language Models As a Source of Rewards”, Baumli et al 2023
- “Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding”, Evans et al 2023
- “ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations”, Patel et al 2023
- “Alpha-CLIP: A CLIP Model Focusing on Wherever You Want”, Sun et al 2023
- “Are Vision Transformers More Data Hungry Than Newborn Visual Systems?”, Pandey et al 2023
- “BioCLIP: A Vision Foundation Model for the Tree of Life”, Stevens et al 2023
- “Rethinking FID: Towards a Better Evaluation Metric for Image Generation”, Jayasumana et al 2023
- “SatCLIP: Global, General-Purpose Location Embeddings With Satellite Imagery”, Klemmer et al 2023
- “Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback”, Prabhudesai et al 2023
- “One-for-All: Towards Universal Domain Translation With a Single StyleGAN”, Du et al 2023
- “Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?”, Mayilvahanan et al 2023
- “From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions”, Lai et al 2023
- “LLaVA-1.5: Improved Baselines With Visual Instruction Tuning”, Liu et al 2023
- “Data Filtering Networks”, Fang et al 2023
- “Vision Transformers Need Registers”, Darcet et al 2023
- “Demystifying CLIP Data”, Xu et al 2023
- “Multimodal Neurons in Pretrained Text-Only Transformers”, Schwettmann et al 2023
- “Investigating the Existence of ‘Secret Language’ in Language Models”, Wang et al 2023
- “InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation”, Wang et al 2023
- “PIGEON: Predicting Image Geolocations”, Haas et al 2023
- “CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution”, Freiberger et al 2023
- “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis”, Podell et al 2023
- “SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality”, Hsieh et al 2023
- “Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model”, Yi et al 2023
- “ChessGPT: Bridging Policy Learning and Language Modeling”, Feng et al 2023
- “Rosetta Neurons: Mining the Common Units in a Model Zoo”, Dravid et al 2023
- “Image Captioners Are Scalable Vision Learners Too”, Tschannen et al 2023
- “Improving Neural Network Representations Using Human Similarity Judgments”, Muttenthaler et al 2023
- “Artificial Intelligence and Art: Identifying the Esthetic Judgment Factors That Distinguish Human & Machine-generated Artwork”, Samo & Highhouse 2023
- “Generalizable Synthetic Image Detection via Language-guided Contrastive Learning”, Wu et al 2023
- “TorToise: Better Speech Synthesis through Scaling”, Betker 2023
- “ImageBind: One Embedding Space To Bind Them All”, Girdhar et al 2023
- “Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation”, Kirstain et al 2023
- “A Cookbook of Self-Supervised Learning”, Balestriero et al 2023
- “DINOv2: Learning Robust Visual Features without Supervision”, Oquab et al 2023
- “KD-DLGAN: Data Limited Image Generation via Knowledge Distillation”, Cui et al 2023
- “MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks”, Kuo et al 2023
- “Sigmoid Loss for Language Image Pre-Training”, Zhai et al 2023
- “HiCLIP: Contrastive Language-Image Pretraining With Hierarchy-aware Attention”, Geng et al 2023
- “When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?”, Yuksekgonul et al 2023
- “Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery”, Wen et al 2023
- “BLIP-2: Bootstrapping Language-Image Pre-training With Frozen Image Encoders and Large Language Models”, Li et al 2023
- “MUG: Vision Learners Meet Web Image-Text Pairs”, Zhao et al 2023
- “Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B”, Wortsman 2023
- “Reproducible Scaling Laws for Contrastive Language-image Learning”, Cherti et al 2022
- “CLIP Itself Is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy With ViT-B and ViT-L on ImageNet”, Dong et al 2022
- “A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others”, Li et al 2022
- “Scaling Language-Image Pre-training via Masking”, Li et al 2022
- “Videogenic: Video Highlights via Photogenic Moments”, Lin et al 2022
- “Retrieval-Augmented Multimodal Language Modeling”, Yasunaga et al 2022
- “ClipCrop: Conditioned Cropping Driven by Vision-Language Model”, Zhong et al 2022
- “I Can’t Believe There’s No Images! Learning Visual Tasks Using Only Language Data”, Gu et al 2022
- “MaskDistill: A Unified View of Masked Image Modeling”, Anonymous 2022
- “Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces”, Rampas et al 2022
- “AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities”, Chen et al 2022
- “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
- “Text-Only Training for Image Captioning Using Noise-Injected CLIP”, Nukrai et al 2022
- “3DALL·E: Integrating Text-to-Image AI in 3D Design Workflows”, Liu et al 2022
- “Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, Gan et al 2022
- “ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training”, Norelli et al 2022
- “Incorporating Natural Language into Vision Models Improves Prediction and Understanding of Higher Visual Cortex”, Wang et al 2022
- “Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest”, Hessel et al 2022
- “Fast Text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators”, Du et al 2022
- “What Does a Platypus Look Like? Generating Customized Prompts for Zero-shot Image Classification (CuPL)”, Pratt et al 2022
- “Decoding Speech from Non-invasive Brain Recordings”, Défossez et al 2022
- “Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP”, Nguyen et al 2022
- “CLIP-based Neural Neighbor Style Transfer for 3D Assets”, Mishra & Granskog 2022
- “EVL: Frozen CLIP Models Are Efficient Video Learners”, Lin et al 2022
- “X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Ni et al 2022
- “LaTTe: Language Trajectory TransformEr”, Bucker et al 2022
- “Adversarial Attacks on Image Generation With Made-Up Words”, Millière 2022
- “TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment”, Srinivasan et al 2022
- “MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training”, You et al 2022
- “Text-Guided Synthesis of Artistic Images With Retrieval-Augmented Diffusion Models”, Rombach et al 2022
- “NewsStories: Illustrating Articles With Visual Summaries”, Tan et al 2022
- “Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models”, Ha & Song 2022
- “Don’t Stop Learning: Towards Continual Learning for the CLIP Model”, Ding et al 2022
- “X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Ma et al 2022
- “Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning”, Santurkar et al 2022
- “LM-Nav: Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action”, Shah et al 2022
- “CLAP: Learning Audio Concepts From Natural Language Supervision”, Elizalde et al 2022
- “ADAPT: Vision-Language Navigation With Modality-Aligned Action Prompts”, Lin et al 2022
- “Improved Vector Quantized Diffusion Models”, Tang et al 2022
- “CyCLIP: Cyclic Contrastive Language-Image Pretraining”, Goel et al 2022
- “Fine-grained Image Captioning With CLIP Reward”, Cho et al 2022
- “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Wang et al 2022
- “AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars”, Hong et al 2022
- “CoCa: Contrastive Captioners Are Image-Text Foundation Models”, Yu et al 2022
- “Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)”, Fang et al 2022
- “Retrieval-Augmented Diffusion Models: Semi-Parametric Neural Image Synthesis”, Blattmann et al 2022
- “Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?”, Cui et al 2022
- “Opal: Multimodal Image Generation for News Illustration”, Liu et al 2022
- “VQGAN-CLIP: Open Domain Image Generation and Editing With Natural Language Guidance”, Crowson et al 2022
- “DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Ramesh et al 2022 (page 16 org openai)
- “No Token Left Behind: Explainability-Aided Image Classification and Generation”, Paiss et al 2022
- “Semantic Exploration from Language Abstractions and Pretrained Representations”, Tam et al 2022
- “Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality”, Thrush et al 2022
- “Unified Contrastive Learning in Image-Text-Label Space”, Yang et al 2022
- “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Zeng et al 2022
- “Learning to Generate Line Drawings That Convey Geometry and Semantics”, Chan et al 2022
- “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Taesiri et al 2022
- “CLIP on Wheels (CoW): Zero-Shot Object Navigation As Object Localization and Exploration”, Gadre et al 2022
- “Bamboo: Building Mega-Scale Vision Dataset Continually With Human-Machine Synergy”, Zhang et al 2022
- “CLIP Models Are Few-shot Learners: Empirical Studies on VQA and Visual Entailment”, Song et al 2022
- “Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision”, Cui et al 2022
- “The Unsurprising Effectiveness of Pre-Trained Vision Models for Control”, Parisi et al 2022
- “Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment”, Zhou et al 2022
- “RuCLIP—new Models and Experiments: a Technical Report”, Shonenkov et al 2022
- “Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework”, Gu et al 2022
- “CLIPasso: Semantically-Aware Object Sketching”, Vinker et al 2022
- “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Li et al 2022
- “Can Wikipedia Help Offline Reinforcement Learning?”, Reid et al 2022
- “SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models”, Singh et al 2022
- “CM3: A Causal Masked Multimodal Model of the Internet”, Aghajanyan et al 2022
- “LSeg: Language-driven Semantic Segmentation”, Li et al 2022
- “Design Guidelines for Prompt Engineering Text-to-Image Generative Models”, Liu & Chilton 2022b
- “Detecting Twenty-thousand Classes Using Image-level Supervision”, Zhou et al 2022
- “A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision”, Tejankar et al 2021
- “High-Resolution Image Synthesis With Latent Diffusion Models”, Rombach et al 2021
- “RegionCLIP: Region-based Language-Image Pretraining”, Zhong et al 2021
- “More Control for Free! Image Synthesis With Semantic Diffusion Guidance”, Liu et al 2021
- “CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions”, Abdal et al 2021
- “MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning”, Eichenberg et al 2021
- “DenseCLIP: Extract Free Dense Labels from CLIP”, Zhou et al 2021
- “Zero-Shot Text-Guided Object Generation With Dream Fields”, Jain et al 2021
- “FuseDream: Training-Free Text-to-Image Generation With Improved CLIP+GAN Space Optimization”, Liu et al 2021
- “MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Soldan et al 2021
- “CRIS: CLIP-Driven Referring Image Segmentation”, Wang et al 2021
- “Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic”, Tewel et al 2021
- “Blended Diffusion for Text-driven Editing of Natural Images”, Avrahami et al 2021
- “LAFITE: Towards Language-Free Training for Text-to-Image Generation”, Zhou et al 2021
- “Florence: A New Foundation Model for Computer Vision”, Yuan et al 2021
- “BASIC: Combined Scaling for Open-Vocabulary Image Classification”, Pham et al 2021
- “ClipCap: CLIP Prefix for Image Captioning”, Mokady et al 2021
- “Simple but Effective: CLIP Embeddings for Embodied AI”, Khandelwal et al 2021
- “INTERN: A New Learning Paradigm Towards General Vision”, Shao et al 2021
- “LiT: Zero-Shot Transfer With Locked-image Text Tuning”, Zhai et al 2021
- “Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling”, Zhang et al 2021
- “StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis”, Schaldenbrand et al 2021
- “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs”, Schuhmann et al 2021
- “Projected GANs Converge Faster”, Sauer et al 2021
- “Telling Creative Stories Using Generative Visual Aids”, Ali & Parikh 2021
- “Image-Based CLIP-Guided Essence Transfer”, Chefer et al 2021
- “Wav2CLIP: Learning Robust Audio Representations From CLIP”, Wu et al 2021
- “Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)”, Li et al 2021
- “CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation”, Sanghi et al 2021
- “MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training”, You et al 2021
- “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Wu et al 2021
- “DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models”, Kim & Ye 2021
- “CLOOB: Modern Hopfield Networks With InfoLOOB Outperform CLIP”, Fürst et al 2021
- “VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Xu et al 2021
- “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Xie & Zheng 2021
- “CLIPort: What and Where Pathways for Robotic Manipulation”, Shridhar et al 2021
-
“
THINGSvision
: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, Muttenthaler & Hebart 2021 - “Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts”, Tian & Ha 2021
- “What Vision-Language Models ‘See’ When They See Scenes”, Cafagna et al 2021
- “EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling”, Wang et al 2021
- “Zero-Shot Open Set Detection by Extending CLIP”, Esmaeilpour et al 2021
- “Robust Fine-tuning of Zero-shot Models”, Wortsman et al 2021
- “What Users Want? WARHOL: A Generative Model for Recommendation”, Samaran et al 2021
- “LAION-400-Million Open Dataset”, Schuhmann 2021
- “Contrastive Language-Image Pre-training for the Italian Language”, Bianchi et al 2021
- “Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications”, Agarwal et al 2021
- “StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators”, Gal et al 2021
- “Language Grounding With 3D Objects”, Thomason et al 2021
- “Segmentation in Style: Unsupervised Semantic Image Segmentation With StyleGAN and CLIP”, Pakhomov et al 2021
- “How Much Can CLIP Benefit Vision-and-Language Tasks?”, Shen et al 2021
- “FairyTailor: A Multimodal Generative Framework for Storytelling”, Bensaid et al 2021
- “CLIP-It! Language-Guided Video Summarization”, Narasimhan et al 2021
- “Small In-distribution Changes in 3D Perspective and Lighting Fool Both CNNs and Transformers”, Madan et al 2021
- “CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”, Frans et al 2021
- “AudioCLIP: Extending CLIP to Image, Text and Audio”, Guzhov et al 2021
- “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Fang et al 2021
- “A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods”, Cheema et al 2021
- “Partial Success in Closing the Gap between Human and Machine Vision”, Geirhos et al 2021
- “ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation”, Zhu et al 2021
- “Exploring the Limits of Out-of-Distribution Detection”, Fort et al 2021
- “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Du 2021
- “Generative Art Using Neural Visual Grammars and Dual Encoders”, Fernando et al 2021
- “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Gu et al 2021
- “CLIPScore: A Reference-free Evaluation Metric for Image Captioning”, Hessel et al 2021
- “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Cheng et al 2021
- “Paint by Word”, Bau et al 2021
- “WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training”, Huo et al 2021
- “Multimodal Neurons in Artificial Neural Networks [CLIP]”, Goh et al 2021
- “Zero-Shot Text-to-Image Generation”, Ramesh et al 2021
- “ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”, Jia et al 2021
- “Generating Images from Caption and vice Versa via CLIP-Guided Generative Latent Space Search”, Galatolo et al 2021
- “Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers”, Hendricks et al 2021
- “Scoring Images from TADNE With CLIP”, nagolinc 2021
- “CLIP: Learning Transferable Visual Models From Natural Language Supervision”, Radford et al 2021
- “CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to the ‘zero-shot’ Capabilities of GPT-2 and GPT-3”, Radford et al 2021
- “DALL·E 1: Creating Images from Text: We’ve Trained a Neural Network Called DALL·E That Creates Images from Text Captions for a Wide Range of Concepts Expressible in Natural Language”, Ramesh et al 2021
- “Transformers in Vision: A Survey”, Khan et al 2021
- “Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, Dosovitskiy et al 2020
- “M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training”, Ni et al 2020
- “Learning to Scale Multilingual Representations for Vision-Language Tasks”, Burns et al 2020
- “The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism”, Hao 2020
- “MULE: Multimodal Universal Language Embedding”, Kim et al 2019
- “This Anime Does Not Exist, Search: This Notebook Uses the Precomputed CLIP Feature Vectors for 100k Images from TADNE”
- “CLIPIT PixelDraw Demo”
- “Clustering-laion400m: Script and Models for Clustering LAION-400m CLIP Embeddings. Models Were Fit on the First Million or so Image Embeddings.”
- “The Bouba/Kiki Effect And Sound Symbolism In CLIP”
- “Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations”
- “[P] List of Sites/programs/projects That Use OpenAI’s CLIP Neural Network for Steering Image/video Creation to Match a Text Description”
- “New AI Tools CLIP+VQ-GAN Can Create Impressive Works of Art Based on Just a Few Words of Input”
- “Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model!”
- Miscellaneous
- Link Bibliography
See Also
Gwern
“Utext: Rich Unicode Documents”, Gwern 2023
Links
“MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training”, McKinzie et al 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
“Discovering Universal Semantic Triggers for Text-to-Image Synthesis”, Zhai et al 2024
Discovering Universal Semantic Triggers for Text-to-Image Synthesis
“TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones”, Yuan et al 2023
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
“Parrot Captions Teach CLIP to Spot Text”, Lin et al 2023
“StarVector: Generating Scalable Vector Graphics Code from Images”, Rodriguez et al 2023
StarVector: Generating Scalable Vector Graphics Code from Images
“Vision-Language Models As a Source of Rewards”, Baumli et al 2023
“Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding”, Evans et al 2023
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
“ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations”, Patel et al 2023
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
“Alpha-CLIP: A CLIP Model Focusing on Wherever You Want”, Sun et al 2023
“Are Vision Transformers More Data Hungry Than Newborn Visual Systems?”, Pandey et al 2023
Are Vision Transformers More Data Hungry Than Newborn Visual Systems?
“BioCLIP: A Vision Foundation Model for the Tree of Life”, Stevens et al 2023
“Rethinking FID: Towards a Better Evaluation Metric for Image Generation”, Jayasumana et al 2023
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
“SatCLIP: Global, General-Purpose Location Embeddings With Satellite Imagery”, Klemmer et al 2023
SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery
“Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback”, Prabhudesai et al 2023
Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback
“One-for-All: Towards Universal Domain Translation With a Single StyleGAN”, Du et al 2023
One-for-All: Towards Universal Domain Translation with a Single StyleGAN
“Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?”, Mayilvahanan et al 2023
Does CLIP’s Generalization Performance Mainly Stem from High Train-Test Similarity?
“From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions”, Lai et al 2023
From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions
“LLaVA-1.5: Improved Baselines With Visual Instruction Tuning”, Liu et al 2023
LLaVA-1.5: Improved Baselines with Visual Instruction Tuning
“Data Filtering Networks”, Fang et al 2023
“Vision Transformers Need Registers”, Darcet et al 2023
“Demystifying CLIP Data”, Xu et al 2023
“Multimodal Neurons in Pretrained Text-Only Transformers”, Schwettmann et al 2023
“Investigating the Existence of ‘Secret Language’ in Language Models”, Wang et al 2023
Investigating the Existence of ‘Secret Language’ in Language Models
“InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation”, Wang et al 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
“PIGEON: Predicting Image Geolocations”, Haas et al 2023
“CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution”, Freiberger et al 2023
CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution
“SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis”, Podell et al 2023
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
“SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality”, Hsieh et al 2023
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
“Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model”, Yi et al 2023
Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model
“ChessGPT: Bridging Policy Learning and Language Modeling”, Feng et al 2023
“Rosetta Neurons: Mining the Common Units in a Model Zoo”, Dravid et al 2023
“Image Captioners Are Scalable Vision Learners Too”, Tschannen et al 2023
“Improving Neural Network Representations Using Human Similarity Judgments”, Muttenthaler et al 2023
Improving neural network representations using human similarity judgments
“Artificial Intelligence and Art: Identifying the Esthetic Judgment Factors That Distinguish Human & Machine-generated Artwork”, Samo & Highhouse 2023
“Generalizable Synthetic Image Detection via Language-guided Contrastive Learning”, Wu et al 2023
Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
“TorToise: Better Speech Synthesis through Scaling”, Betker 2023
“ImageBind: One Embedding Space To Bind Them All”, Girdhar et al 2023
“Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation”, Kirstain et al 2023
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
“A Cookbook of Self-Supervised Learning”, Balestriero et al 2023
“DINOv2: Learning Robust Visual Features without Supervision”, Oquab et al 2023
“KD-DLGAN: Data Limited Image Generation via Knowledge Distillation”, Cui et al 2023
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation
“MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks”, Kuo et al 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
“Sigmoid Loss for Language Image Pre-Training”, Zhai et al 2023
“HiCLIP: Contrastive Language-Image Pretraining With Hierarchy-aware Attention”, Geng et al 2023
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
“When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?”, Yuksekgonul et al 2023
When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?
“Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery”, Wen et al 2023
Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery
“BLIP-2: Bootstrapping Language-Image Pre-training With Frozen Image Encoders and Large Language Models”, Li et al 2023
“MUG: Vision Learners Meet Web Image-Text Pairs”, Zhao et al 2023
“Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B”, Wortsman 2023
Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B
“Reproducible Scaling Laws for Contrastive Language-image Learning”, Cherti et al 2022
Reproducible scaling laws for contrastive language-image learning
“CLIP Itself Is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy With ViT-B and ViT-L on ImageNet”, Dong et al 2022
“A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others”, Li et al 2022
A Whack-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others
“Scaling Language-Image Pre-training via Masking”, Li et al 2022
“Videogenic: Video Highlights via Photogenic Moments”, Lin et al 2022
“Retrieval-Augmented Multimodal Language Modeling”, Yasunaga et al 2022
“ClipCrop: Conditioned Cropping Driven by Vision-Language Model”, Zhong et al 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
“I Can’t Believe There’s No Images! Learning Visual Tasks Using Only Language Data”, Gu et al 2022
I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data
“MaskDistill: A Unified View of Masked Image Modeling”, Anonymous 2022
“Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces”, Rampas et al 2022
Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
“AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities”, Chen et al 2022
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
“EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, Balaji et al 2022
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
“Text-Only Training for Image Captioning Using Noise-Injected CLIP”, Nukrai et al 2022
Text-Only Training for Image Captioning using Noise-Injected CLIP
“3DALL·E: Integrating Text-to-Image AI in 3D Design Workflows”, Liu et al 2022
3DALL·E: Integrating Text-to-Image AI in 3D Design Workflows
“Vision-Language Pre-training: Basics, Recent Advances, and Future Trends”, Gan et al 2022
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
“ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training”, Norelli et al 2022
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training
“Incorporating Natural Language into Vision Models Improves Prediction and Understanding of Higher Visual Cortex”, Wang et al 2022
“Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks from The New Yorker Caption Contest”, Hessel et al 2022
“Fast Text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators”, Du et al 2022
“What Does a Platypus Look Like? Generating Customized Prompts for Zero-shot Image Classification (CuPL)”, Pratt et al 2022
“Decoding Speech from Non-invasive Brain Recordings”, Défossez et al 2022
“Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP”, Nguyen et al 2022
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP
“CLIP-based Neural Neighbor Style Transfer for 3D Assets”, Mishra & Granskog 2022
“EVL: Frozen CLIP Models Are Efficient Video Learners”, Lin et al 2022
“X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition”, Ni et al 2022
X-CLIP: Expanding Language-Image Pretrained Models for General Video Recognition
“LaTTe: Language Trajectory TransformEr”, Bucker et al 2022
“Adversarial Attacks on Image Generation With Made-Up Words”, Millière 2022
“TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment”, Srinivasan et al 2022
TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment
“MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training”, You et al 2022
MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
“Text-Guided Synthesis of Artistic Images With Retrieval-Augmented Diffusion Models”, Rombach et al 2022
Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models
“NewsStories: Illustrating Articles With Visual Summaries”, Tan et al 2022
“Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models”, Ha & Song 2022
Semantic Abstraction (SemAbs): Open-World 3D Scene Understanding from 2D Vision-Language Models
“Don’t Stop Learning: Towards Continual Learning for the CLIP Model”, Ding et al 2022
Don’t Stop Learning: Towards Continual Learning for the CLIP Model
“X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Ma et al 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
“Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning”, Santurkar et al 2022
Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning
“LM-Nav: Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action”, Shah et al 2022
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
“CLAP: Learning Audio Concepts From Natural Language Supervision”, Elizalde et al 2022
CLAP: Learning Audio Concepts From Natural Language Supervision
“ADAPT: Vision-Language Navigation With Modality-Aligned Action Prompts”, Lin et al 2022
ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
“Improved Vector Quantized Diffusion Models”, Tang et al 2022
“CyCLIP: Cyclic Contrastive Language-Image Pretraining”, Goel et al 2022
“Fine-grained Image Captioning With CLIP Reward”, Cho et al 2022
“VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Wang et al 2022
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
“AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars”, Hong et al 2022
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
“CoCa: Contrastive Captioners Are Image-Text Foundation Models”, Yu et al 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
“Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)”, Fang et al 2022
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
“Retrieval-Augmented Diffusion Models: Semi-Parametric Neural Image Synthesis”, Blattmann et al 2022
Retrieval-Augmented Diffusion Models: Semi-Parametric Neural Image Synthesis
“Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?”, Cui et al 2022
Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?
“Opal: Multimodal Image Generation for News Illustration”, Liu et al 2022
“VQGAN-CLIP: Open Domain Image Generation and Editing With Natural Language Guidance”, Crowson et al 2022
VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
“DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Ramesh et al 2022 (page 16 org openai)
“No Token Left Behind: Explainability-Aided Image Classification and Generation”, Paiss et al 2022
No Token Left Behind: Explainability-Aided Image Classification and Generation
“Semantic Exploration from Language Abstractions and Pretrained Representations”, Tam et al 2022
Semantic Exploration from Language Abstractions and Pretrained Representations
“Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality”, Thrush et al 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
“Unified Contrastive Learning in Image-Text-Label Space”, Yang et al 2022
“Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Zeng et al 2022
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
“Learning to Generate Line Drawings That Convey Geometry and Semantics”, Chan et al 2022
Learning to generate line drawings that convey geometry and semantics
“CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Taesiri et al 2022
“CLIP on Wheels (CoW): Zero-Shot Object Navigation As Object Localization and Exploration”, Gadre et al 2022
CLIP on Wheels (CoW): Zero-Shot Object Navigation as Object Localization and Exploration
“Bamboo: Building Mega-Scale Vision Dataset Continually With Human-Machine Synergy”, Zhang et al 2022
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy
“CLIP Models Are Few-shot Learners: Empirical Studies on VQA and Visual Entailment”, Song et al 2022
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
“Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision”, Cui et al 2022
“The Unsurprising Effectiveness of Pre-Trained Vision Models for Control”, Parisi et al 2022
The Unsurprising Effectiveness of Pre-Trained Vision Models for Control
“Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment”, Zhou et al 2022
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
“RuCLIP—new Models and Experiments: a Technical Report”, Shonenkov et al 2022
“Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework”, Gu et al 2022
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework
“CLIPasso: Semantically-Aware Object Sketching”, Vinker et al 2022
“BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Li et al 2022
“Can Wikipedia Help Offline Reinforcement Learning?”, Reid et al 2022
“SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models”, Singh et al 2022
SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models
“CM3: A Causal Masked Multimodal Model of the Internet”, Aghajanyan et al 2022
“LSeg: Language-driven Semantic Segmentation”, Li et al 2022
“Design Guidelines for Prompt Engineering Text-to-Image Generative Models”, Liu & Chilton 2022b
Design Guidelines for Prompt Engineering Text-to-Image Generative Models
“Detecting Twenty-thousand Classes Using Image-level Supervision”, Zhou et al 2022
Detecting Twenty-thousand Classes using Image-level Supervision
“A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision”, Tejankar et al 2021
A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision
“High-Resolution Image Synthesis With Latent Diffusion Models”, Rombach et al 2021
High-Resolution Image Synthesis with Latent Diffusion Models
“RegionCLIP: Region-based Language-Image Pretraining”, Zhong et al 2021
“More Control for Free! Image Synthesis With Semantic Diffusion Guidance”, Liu et al 2021
More Control for Free! Image Synthesis with Semantic Diffusion Guidance
“CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions”, Abdal et al 2021
CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions
“MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning”, Eichenberg et al 2021
MAGMA—Multimodal Augmentation of Generative Models through Adapter-based Finetuning
“DenseCLIP: Extract Free Dense Labels from CLIP”, Zhou et al 2021
“Zero-Shot Text-Guided Object Generation With Dream Fields”, Jain et al 2021
“FuseDream: Training-Free Text-to-Image Generation With Improved CLIP+GAN Space Optimization”, Liu et al 2021
FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization
“MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions”, Soldan et al 2021
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
“CRIS: CLIP-Driven Referring Image Segmentation”, Wang et al 2021
“Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic”, Tewel et al 2021
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
“Blended Diffusion for Text-driven Editing of Natural Images”, Avrahami et al 2021
“LAFITE: Towards Language-Free Training for Text-to-Image Generation”, Zhou et al 2021
LAFITE: Towards Language-Free Training for Text-to-Image Generation
“Florence: A New Foundation Model for Computer Vision”, Yuan et al 2021
“BASIC: Combined Scaling for Open-Vocabulary Image Classification”, Pham et al 2021
BASIC: Combined Scaling for Open-Vocabulary Image Classification
“ClipCap: CLIP Prefix for Image Captioning”, Mokady et al 2021
“Simple but Effective: CLIP Embeddings for Embodied AI”, Khandelwal et al 2021
“INTERN: A New Learning Paradigm Towards General Vision”, Shao et al 2021
“LiT: Zero-Shot Transfer With Locked-image Text Tuning”, Zhai et al 2021
“Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling”, Zhang et al 2021
Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling
“StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis”, Schaldenbrand et al 2021
StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
“LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs”, Schuhmann et al 2021
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
“Projected GANs Converge Faster”, Sauer et al 2021
“Telling Creative Stories Using Generative Visual Aids”, Ali & Parikh 2021
“Image-Based CLIP-Guided Essence Transfer”, Chefer et al 2021
“Wav2CLIP: Learning Robust Audio Representations From CLIP”, Wu et al 2021
“Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)”, Li et al 2021
“CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation”, Sanghi et al 2021
“MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training”, You et al 2021
MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training
“OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Wu et al 2021
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation
“DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models”, Kim & Ye 2021
DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models
“CLOOB: Modern Hopfield Networks With InfoLOOB Outperform CLIP”, Fürst et al 2021
CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP
“VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding”, Xu et al 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
“ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Xie & Zheng 2021
ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language Knowledge Distillation
“CLIPort: What and Where Pathways for Robotic Manipulation”, Shridhar et al 2021
“THINGSvision
: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, Muttenthaler & Hebart 2021
“Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts”, Tian & Ha 2021
Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts
“What Vision-Language Models ‘See’ When They See Scenes”, Cafagna et al 2021
“EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling”, Wang et al 2021
“Zero-Shot Open Set Detection by Extending CLIP”, Esmaeilpour et al 2021
“Robust Fine-tuning of Zero-shot Models”, Wortsman et al 2021
“What Users Want? WARHOL: A Generative Model for Recommendation”, Samaran et al 2021
What Users Want? WARHOL: A Generative Model for Recommendation
“LAION-400-Million Open Dataset”, Schuhmann 2021
“Contrastive Language-Image Pre-training for the Italian Language”, Bianchi et al 2021
Contrastive Language-Image Pre-training for the Italian Language
“Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications”, Agarwal et al 2021
Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications
“StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators”, Gal et al 2021
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
“Language Grounding With 3D Objects”, Thomason et al 2021
“Segmentation in Style: Unsupervised Semantic Image Segmentation With StyleGAN and CLIP”, Pakhomov et al 2021
Segmentation in Style: Unsupervised Semantic Image Segmentation with StyleGAN and CLIP
“How Much Can CLIP Benefit Vision-and-Language Tasks?”, Shen et al 2021
“FairyTailor: A Multimodal Generative Framework for Storytelling”, Bensaid et al 2021
FairyTailor: A Multimodal Generative Framework for Storytelling
“CLIP-It! Language-Guided Video Summarization”, Narasimhan et al 2021
“Small In-distribution Changes in 3D Perspective and Lighting Fool Both CNNs and Transformers”, Madan et al 2021
Small in-distribution changes in 3D perspective and lighting fool both CNNs and Transformers
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”, Frans et al 2021
CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders
“AudioCLIP: Extending CLIP to Image, Text and Audio”, Guzhov et al 2021
“CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Fang et al 2021
“A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods”, Cheema et al 2021
A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods
“Partial Success in Closing the Gap between Human and Machine Vision”, Geirhos et al 2021
Partial success in closing the gap between human and machine vision
“ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation”, Zhu et al 2021
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation
“Exploring the Limits of Out-of-Distribution Detection”, Fort et al 2021
“Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Du 2021
Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters
“Generative Art Using Neural Visual Grammars and Dual Encoders”, Fernando et al 2021
Generative Art Using Neural Visual Grammars and Dual Encoders
“Zero-Shot Detection via Vision and Language Knowledge Distillation”, Gu et al 2021
Zero-Shot Detection via Vision and Language Knowledge Distillation
“CLIPScore: A Reference-free Evaluation Metric for Image Captioning”, Hessel et al 2021
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
“Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Cheng et al 2021
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
“Paint by Word”, Bau et al 2021
“WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training”, Huo et al 2021
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
“Multimodal Neurons in Artificial Neural Networks [CLIP]”, Goh et al 2021
“Zero-Shot Text-to-Image Generation”, Ramesh et al 2021
“ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”, Jia et al 2021
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
“Generating Images from Caption and vice Versa via CLIP-Guided Generative Latent Space Search”, Galatolo et al 2021
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search
“Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers”, Hendricks et al 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
“Scoring Images from TADNE With CLIP”, nagolinc 2021
“CLIP: Learning Transferable Visual Models From Natural Language Supervision”, Radford et al 2021
CLIP: Learning Transferable Visual Models From Natural Language Supervision
“CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to the ‘zero-shot’ Capabilities of GPT-2 and GPT-3”, Radford et al 2021
“DALL·E 1: Creating Images from Text: We’ve Trained a Neural Network Called DALL·E That Creates Images from Text Captions for a Wide Range of Concepts Expressible in Natural Language”, Ramesh et al 2021
“Transformers in Vision: A Survey”, Khan et al 2021
“Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, Dosovitskiy et al 2020
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
“M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training”, Ni et al 2020
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training
“Learning to Scale Multilingual Representations for Vision-Language Tasks”, Burns et al 2020
Learning to Scale Multilingual Representations for Vision-Language Tasks
“The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism”, Hao 2020
“MULE: Multimodal Universal Language Embedding”, Kim et al 2019
“This Anime Does Not Exist, Search: This Notebook Uses the Precomputed CLIP Feature Vectors for 100k Images from TADNE”
“CLIPIT PixelDraw Demo”
“Clustering-laion400m: Script and Models for Clustering LAION-400m CLIP Embeddings. Models Were Fit on the First Million or so Image Embeddings.”
“The Bouba/Kiki Effect And Sound Symbolism In CLIP”
“Pixels Still Beat Text: Attacking the OpenAI CLIP Model With Text Patches and Adversarial Pixel Perturbations”
“[P] List of Sites/programs/projects That Use OpenAI’s CLIP Neural Network for Steering Image/video Creation to Match a Text Description”
“New AI Tools CLIP+VQ-GAN Can Create Impressive Works of Art Based on Just a Few Words of Input”
New AI tools CLIP+VQ-GAN can create impressive works of art based on just a few words of input
“Apple or IPod? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model!”
Apple or iPod? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model!
Miscellaneous
-
/doc/ai/nn/transformer/clip/2023-01-01-ross-openclipscalingforclipvitbigg14laion2b39bb160k.png
: -
/doc/ai/nn/transformer/clip/2022-04-04-rombach-compvistxt2imgpreview.png
: -
/doc/ai/nn/transformer/clip/2022-cherti-figure1b-openclipcomputezeroshotretrievalscalingcurve.png
: -
/doc/ai/nn/transformer/clip/2022-singh-figure3-scalingmodelanddatasetsizes.png
: -
/doc/ai/nn/transformer/clip/2021-04-22-rivershavewings-clipvqgan-theshadowyhackergroupeleuther.png
-
/doc/ai/nn/gan/stylegan/anime/2021-01-20-nagolinc-tadne-clipbasedgeneration-agirlwithapinkhat.png
-
/doc/ai/nn/transformer/clip/2021-radford-clip-figure13-cliprobustness.png
: -
/doc/ai/nn/transformer/clip/2021-radford-clip-figure21-zeroshot36differenttasks.png
: -
/doc/ai/nn/transformer/clip/2021-radford-clip-figure4-promptengineering.png
: -
/doc/ai/nn/transformer/clip/2021-radford-clip-figure5-clipzeroshotvsfullresnet.png
: -
/doc/ai/nn/transformer/clip/2021-radford-clip-figure9-clipcomputescaling.png
: -
https://colab.research.google.com/drive/189LHTpYaefMhKNIGOzTLHHavlgmoIWg9
-
https://colab.research.google.com/drive/1N8Cc9yYzNR4M9J2NtE3n3jL2Jy25KAl_
-
https://colab.research.google.com/drive/1c6VccMPsOMAUQCKU4BVDRd5Y32qkozmK
-
https://colab.research.google.com/github/kvfrans/clipdraw/blob/main/clipdraw.ipynb
-
https://creator.nightcafe.studio/vqgan-clip-keyword-modifier-comparison
: -
https://github.com/EleutherAI/vqgan-clip/tree/main/notebooks
-
https://github.com/LAION-AI/laion-datasets/blob/main/laion-aesthetic.md
: -
https://github.com/christophschuhmann/4MC-4M-Image-Text-Pairs-with-CLIP-embeddings
-
https://github.com/openai/CLIP/blob/main/data/yfcc100m.md
:View Markdown:
-
https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup
-
https://jxmo.notion.site/The-Weird-and-Wonderful-World-of-AI-Art-b9615a2e7278435b98380ff81ae1cf09
-
https://stanislavfort.com/2021/01/12/OpenAI_CLIP_adversarial_examples.html
-
https://stanislavfort.github.io/blog/OpenAI_CLIP_adversarial_examples/
: -
https://stanislavfort.github.io/blog/OpenAI_CLIP_stickers_and_adversarial_examples/
: -
https://tech.pic-collage.com/distillation-of-clip-model-and-other-experiments-f8394b7321ce
-
https://twitter.com/NicholasBardy/status/1530461357048418304
-
https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA
-
https://web.media.mit.edu/~echu/assets/projects/evolving-views/paper.pdf
: -
https://www.lesswrong.com/posts/cqGEQeLNbcptYsifz/this-week-in-fashion
-
https://www.reddit.com/r/MachineLearning/comments/nq4es7/d_unreal_engine_trick_with_vqgan_clip/
: -
https://www.unlimiteddreamco.xyz/articles/writing-good-prompts-part-1/
: -
https://www.unlimiteddreamco.xyz/articles/writing-good-prompts-part-2/
: -
https://www.unlimiteddreamco.xyz/articles/writing-good-prompts-part-3/
:
Link Bibliography
-
https://arxiv.org/abs/2312.16862
: “TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones”, Zhengqing Yuan, Zhaoxu Li, Lichao Sun -
https://arxiv.org/abs/2312.11556
: “StarVector: Generating Scalable Vector Graphics Code from Images”, Juan A. Rodriguez, Shubham Agarwal, Issam H. Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, Marco Pedersoli -
https://arxiv.org/abs/2312.05328#deepmind
: “Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding”, Talfan Evans, Shreya Pathak, Hamza Merzic, Jonathan Schwarz, Ryutaro Tanno, Olivier J. Henaff -
https://arxiv.org/abs/2309.17425#apple
: “Data Filtering Networks”, Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, Vaishaal Shankar -
https://arxiv.org/abs/2309.16671
: “Demystifying CLIP Data”, -
https://arxiv.org/abs/2307.01952#stability
: “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis”, Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach -
2023-yi.pdf
: “Anime Character Identification and Tag Prediction by Multimodality Modeling: Dataset and Model”, Fan Yi, Jiaxiang Wu, Minyi Zhao, Shuigeng Zhou -
https://arxiv.org/abs/2306.09346
: “Rosetta Neurons: Mining the Common Units in a Model Zoo”, Amil Dravid, Yossi Gandelsman, Alexei A. Efros, Assaf Shocher -
2023-samo.pdf
: “Artificial Intelligence and Art: Identifying the Esthetic Judgment Factors That Distinguish Human & Machine-generated Artwork”, Andrew Samo, Scott Highhouse -
https://arxiv.org/abs/2305.05665#facebook
: “ImageBind: One Embedding Space To Bind Them All”, Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Arm, Joulin, Ishan Misra -
https://arxiv.org/abs/2305.01569
: “Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation”, Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy -
https://arxiv.org/abs/2304.07193#facebook
: “DINOv2: Learning Robust Visual Features without Supervision”, -
https://arxiv.org/abs/2303.15343#google
: “Sigmoid Loss for Language Image Pre-Training”, Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer -
https://openreview.net/forum?id=KRLUvxh8uaX
: “When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?”, Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, James Zou -
https://arxiv.org/abs/2301.12597#salesforce
: “BLIP-2: Bootstrapping Language-Image Pre-training With Frozen Image Encoders and Large Language Models”, Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi -
https://arxiv.org/abs/2301.07088#bytedance
: “MUG: Vision Learners Meet Web Image-Text Pairs”, Bingchen Zhao, Quan Cui, Hao Wu, Osamu Yoshie, Cheng Yang -
https://laion.ai/blog/giant-openclip/
: “Reaching 80% Zero-Shot Accuracy With OpenCLIP: VIT-G/14 Trained On LAION-2B”, Mitchell Wortsman -
https://arxiv.org/abs/2212.07143
: “Reproducible Scaling Laws for Contrastive Language-image Learning”, -
https://arxiv.org/abs/2212.06138#microsoft
: “CLIP Itself Is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy With ViT-B and ViT-L on ImageNet”, -
https://arxiv.org/abs/2211.12561#facebook
: “Retrieval-Augmented Multimodal Language Modeling”, -
https://openreview.net/forum?id=wmGlMhaBe0
: “MaskDistill: A Unified View of Masked Image Modeling”, Anonymous -
https://arxiv.org/abs/2211.07292
: “Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces”, Dominic Rampas, Pablo Pernias, Elea Zhong, Marc Aubreville -
https://arxiv.org/abs/2211.06679#baai
: “AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities”, Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu -
https://arxiv.org/abs/2211.01324#nvidia
: “EDiff-I: Text-to-Image Diffusion Models With an Ensemble of Expert Denoisers”, -
https://www.biorxiv.org/content/10.1101/2022.09.27.508760.full
: “Incorporating Natural Language into Vision Models Improves Prediction and Understanding of Higher Visual Cortex”, Aria Y. Wang, Kendrick Kay, Thomas Naselaris, Michael J. Tarr, Leila Wehbe -
https://arxiv.org/abs/2209.03953
: “Fast Text2StyleGAN: Text-Free Learning of a Natural Language Interface for Pretrained Face Generators”, Xiaodan Du, Raymond A. Yeh, Nicholas Kolkin, Eli Shechtman, Greg Shakhnarovich -
https://arxiv.org/abs/2209.03320
: “What Does a Platypus Look Like? Generating Customized Prompts for Zero-shot Image Classification (CuPL)”, Sarah Pratt, Rosanne Liu, Ali Farhadi -
https://arxiv.org/abs/2208.12266#facebook
: “Decoding Speech from Non-invasive Brain Recordings”, Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, Jean-Rémi King -
https://arxiv.org/abs/2208.05516
: “Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP”, Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, Ludwig Schmidt -
https://arxiv.org/abs/2208.03550
: “EVL: Frozen CLIP Models Are Efficient Video Learners”, Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li -
https://arxiv.org/abs/2207.14525
: “TOnICS: Curriculum Learning for Data-Efficient Vision-Language Alignment”, Tejas Srinivasan, Xiang Ren, Jesse Thomason -
https://arxiv.org/abs/2207.12661#microsoft
: “MS-CLIP: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training”, Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan -
https://arxiv.org/abs/2207.13061
: “NewsStories: Illustrating Articles With Visual Summaries”, Reuben Tan, Bryan A. Plummer, Kate Saenko, J. P. Lewis, Avneesh Sud, Thomas Leung -
https://arxiv.org/abs/2207.07285#alibaba
: “X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval”, Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji -
https://arxiv.org/abs/2207.07635
: “Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning”, Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, Tatsunori Hashimoto -
https://arxiv.org/abs/2207.04429
: “LM-Nav: Robotic Navigation With Large Pre-Trained Models of Language, Vision, and Action”, Dhruv Shah, Blazej Osinski, Brian Ichter, Sergey Levine -
https://arxiv.org/abs/2205.16007#microsoft
: “Improved Vector Quantized Diffusion Models”, Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen, Fang Wen -
https://arxiv.org/abs/2205.14459
: “CyCLIP: Cyclic Contrastive Language-Image Pretraining”, Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan A. Rossi, Vishwa Vinay, Aditya Grover -
https://arxiv.org/abs/2205.10747
: “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, -
https://arxiv.org/abs/2205.08535
: “AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars”, Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, Ziwei Liu -
https://arxiv.org/abs/2205.01917#google
: “CoCa: Contrastive Captioners Are Image-Text Foundation Models”, Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu -
https://arxiv.org/abs/2205.01397
: “Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)”, Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt -
https://arxiv.org/pdf/2204.06125.pdf#page=16&org=openai
: “DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen -
https://arxiv.org/abs/2204.05080#deepmind
: “Semantic Exploration from Language Abstractions and Pretrained Representations”, -
https://arxiv.org/abs/2204.03610#microsoft
: “Unified Contrastive Learning in Image-Text-Label Space”, Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, Jianfeng Gao -
https://arxiv.org/abs/2204.00598#google
: “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, -
https://arxiv.org/abs/2203.11096
: “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-shot Transfer Learning”, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer -
https://arxiv.org/abs/2202.06767#huawei
: “Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework”, -
https://arxiv.org/abs/2201.12086#salesforce
: “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation”, Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi -
https://arxiv.org/abs/2201.08371#facebook
: “SWAG: Revisiting Weakly Supervised Pre-Training of Visual Perception Models”, -
2022-liu-2.pdf
: “Design Guidelines for Prompt Engineering Text-to-Image Generative Models”, Vivian Liu, Lydia B. Chilton -
https://arxiv.org/abs/2201.02605#facebook
: “Detecting Twenty-thousand Classes Using Image-level Supervision”, Xingyi Zhou, Rohit Girdhar, Arm, Joulin, Phillip Krähenbühl, Ishan Misra -
https://arxiv.org/abs/2112.10752
: “High-Resolution Image Synthesis With Latent Diffusion Models”, Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer -
https://arxiv.org/abs/2112.09106#microsoft
: “RegionCLIP: Region-based Language-Image Pretraining”, -
https://arxiv.org/abs/2112.05744
: “More Control for Free! Image Synthesis With Semantic Diffusion Guidance”, -
https://arxiv.org/abs/2112.01071
: “DenseCLIP: Extract Free Dense Labels from CLIP”, Chong Zhou, Chen Change Loy, Bo Dai -
https://arxiv.org/abs/2112.01573
: “FuseDream: Training-Free Text-to-Image Generation With Improved CLIP+GAN Space Optimization”, Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, Qiang Liu -
https://arxiv.org/abs/2111.11432#microsoft
: “Florence: A New Foundation Model for Computer Vision”, -
https://arxiv.org/abs/2111.10050#google
: “BASIC: Combined Scaling for Open-Vocabulary Image Classification”, -
https://arxiv.org/abs/2111.09734
: “ClipCap: CLIP Prefix for Image Captioning”, Ron Mokady, Amir Hertz, Amit H. Bermano -
https://arxiv.org/abs/2111.07991#google
: “LiT: Zero-Shot Transfer With Locked-image Text Tuning”, Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer -
https://arxiv.org/abs/2111.03930
: “Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling”, Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li -
https://arxiv.org/abs/2111.03133
: “StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis”, Peter Schaldenbrand, Zhixuan Liu, Jean Oh -
https://arxiv.org/abs/2111.02114#laion
: “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs”, -
https://arxiv.org/abs/2111.01007
: “Projected GANs Converge Faster”, Axel Sauer, Kashyap Chitta, Jens Müller, Andreas Geiger -
https://arxiv.org/abs/2110.05208
: “Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm (DeCLIP)”, Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, Junjie Yan -
https://openreview.net/forum?id=ROteIE-4A6W
: “MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training”, Haoxuan You, Luowei Zhou, Bin Xiao, Noel C. Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan -
https://openreview.net/forum?id=G89-1yZLFHk
: “OTTER: Data Efficient Language-Supervised Zero-Shot Recognition With Optimal Transport Distillation”, Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez -
https://openreview.net/forum?id=qw674L9PfQE
: “CLOOB: Modern Hopfield Networks With InfoLOOB Outperform CLIP”, -
https://arxiv.org/abs/2109.12066
: “ZSD-YOLO: Zero-Shot YOLO Detection Using Vision-Language Knowledge Distillation”, Johnathan Xie, Shuai Zheng -
https://www.frontiersin.org/articles/10.3389/fninf.2021.679838/full
: “THINGSvision
: A Python Toolbox for Streamlining the Extraction of Activations From Deep Neural Networks”, Lukas Muttenthaler, Martin N. Hebart -
https://arxiv.org/abs/2109.08857#google
: “Modern Evolution Strategies for Creativity: Fitting Concrete Images and Abstract Concepts”, Yingtao Tian, David Ha -
https://laion.ai/blog/laion-400-open-dataset/
: “LAION-400-Million Open Dataset”, Christoph Schuhmann -
https://arxiv.org/abs/2107.12518
: “Segmentation in Style: Unsupervised Semantic Image Segmentation With StyleGAN and CLIP”, Daniil Pakhomov, Sanchit Hira, Narayani Wagle, Kemar E. Green, Nassir Navab -
https://arxiv.org/abs/2107.06383
: “How Much Can CLIP Benefit Vision-and-Language Tasks?”, Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer -
https://arxiv.org/abs/2106.16198
: “Small In-distribution Changes in 3D Perspective and Lighting Fool Both CNNs and Transformers”, Spandan Madan, Tomotake Sasaki, Tzu-Mao Li, Xavier Boix, Hanspeter Pfister -
https://arxiv.org/abs/2106.13043
: “AudioCLIP: Extending CLIP to Image, Text and Audio”, Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel -
https://arxiv.org/abs/2106.11097
: “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen -
https://arxiv.org/abs/2106.07411
: “Partial Success in Closing the Gap between Human and Machine Vision”, -
https://arxiv.org/abs/2106.03004#google
: “Exploring the Limits of Out-of-Distribution Detection”, Stanislav Fort, Jie Ren, Balaji Lakshminarayanan -
https://en.pingwest.com/a/8693#baai
: “Chinese AI Lab Challenges Google, OpenAI With a Model of 1.75 Trillion Parameters”, Chen Du -
https://arxiv.org/abs/2104.13921#google
: “Zero-Shot Detection via Vision and Language Knowledge Distillation”, Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui -
https://arxiv.org/abs/2104.08945#facebook
: “Data-Efficient Language-Supervised Zero-Shot Learning With Self-Distillation”, Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, Joseph E. Gonzalez -
https://distill.pub/2021/multimodal-neurons/#openai
: “Multimodal Neurons in Artificial Neural Networks [CLIP]”, -
https://arxiv.org/abs/2102.05918#google
: “ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”, -
https://arxiv.org/abs/2102.01645
: “Generating Images from Caption and vice Versa via CLIP-Guided Generative Latent Space Search”, Federico A. Galatolo, Mario G. C. A. Cimino, Gigliola Vaglini -
https://github.com/nagolinc/notebooks/blob/main/TADNE_and_CLIP.ipynb
: “Scoring Images from TADNE With CLIP”, nagolinc -
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf
: “CLIP: Learning Transferable Visual Models From Natural Language Supervision”, -
https://openai.com/research/clip
: “CLIP: Connecting Text and Images: We’re Introducing a Neural Network Called CLIP Which Efficiently Learns Visual Concepts from Natural Language Supervision. CLIP Can Be Applied to Any Visual Classification Benchmark by Simply Providing the Names of the Visual Categories to Be Recognized, Similar to the ‘zero-shot’ Capabilities of GPT-2 and GPT-3”, Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal -
https://openai.com/research/dall-e
: “DALL·E 1: Creating Images from Text: We’ve Trained a Neural Network Called DALL·E That Creates Images from Text Captions for a Wide Range of Concepts Expressible in Natural Language”, -
https://arxiv.org/abs/2010.11929#google
: “Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, -
https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/
: “The Messy, Secretive Reality behind OpenAI’s Bid to save the World: The AI Moonshot Was Founded in the Spirit of Transparency. This Is the inside Story of How Competitive Pressure Eroded That Idealism”, Karen Hao