October 18, 2019
Written by
I. Zeki Yalniz, Hervé Jégou, Dhruv Mahajan
Accurate image and video classification is important for a wide range of computer vision applications, from identifying harmful content, to making products more accessible to the visually impaired, to helping people more easily buy and sell things on products like Marketplace. Facebook AI is developing alternative ways to train our AI systems so that we can do more with less labeled training data overall, and also deliver accurate results even when large, high-quality labeled data sets are simply not available. Today, we are sharing details on a versatile new model training technique that delivers state-of-the-art accuracy for image and video classification systems.
This approach, which we call semi-weak supervision, is a new way to combine the merits of two different training methods: semi-supervised learning and weakly supervised learning. It opens the door the door to creating more accurate, efficient production classification models by using a teacher-student model training paradigm and billion-scale weakly supervised data sets. If the weakly supervised data sets (such as the hashtags associated with publicly available photos) are not available for the target classification task, our method can also make use of unlabeled data sets to produce highly accurate semi-supervised models.
Our semi-weakly supervised training framework has let us set a new state of the art on academic benchmarks for lightweight image and video classification models. We achieved 81.2 percent top-1 accuracy on ImageNet using the ResNet-50 model for our benchmarking tests. In the case of Kinetics video action classification benchmark, we achieved 74.2 percent top-1 accuracy on the validation set with a low-capacity R(2+1)D-18 model. This is a 2.7 percent improvement over the previous state of the art results obtained by the same capacity weakly supervised R(2+1)D-18 model using the same input data sets and compute resources.
Semi-weakly supervised learning helps reduce the accuracy gap between the high-capacity state-of-the-art models and the computationally efficient production-grade models. Our approach is enabling Facebook to create efficient, low-capacity production-ready models that deliver substantially higher accuracy than was previously possible, which will improve products used by billions of people.
In training a target classification model using only labeled data, the accuracy of the target model is highly dependent on the scale and quality of the data set. But the human labeling of training data required for this fully supervised approach cannot scale to all the possible visual concepts in the world. Labeling thousands of species of plants and animals, for example, is resource intensive and requires extensive domain expertise.
In 2018, Facebook AI researchers demonstrated that we could use the hashtags associated with billions of publicly available Instagram photos to train highly accurate classification models. This approach identifies a set of related hashtags for the target classification task, uses associated images for pretraining, and then fine-tunes the target model with all the available labeled examples. This is a weakly supervised approach, since hashtagged data sets contain significant label noise — for example, tags such as “love” are used subjectively and idiosyncratically, and tags like “perseverance” refer to abstract concepts. But despite these challenges, we were able to train very large capacity weakly supervised models that delivered state-of-the-art accuracy. We open-sourced the classification models that produced these results on various benchmarks.
Although weak supervision has delivered noteworthy successes for well-known academic benchmarks, it has limitations. Hashtagged content is not always available for a particular classification task. On Facebook and Instagram, for example, a large amount of visual content doesn’t have any associated hashtag. And while publicly available unlabeled photos are extremely plentiful, weak supervision cannot use this data for pretraining models. Furthermore, the state-of-the-art weakly supervised classification models are of high capacity and computationally quite expensive. These constraints prompted our exploration of ways to make use of the immense amount of publicly available unlabeled data sets for building more accurate classification models.