A common refrain for computer vision researchers is that modern deep neural networks are always hungry for more labeled data — current state-of-the-art CNNs need to be trained on datasets such asOpenImagesorPlaces, which consist of over 1M labelled images. However, for many applications, collectingthis amount of labeled datacan be prohibitive to the average practitioner.
A common approach to mitigate the lack of labeled data for computer vision tasks is to use models that have beenpre-trainedon generic data (e.g.,ImageNet). The idea is that visual features learned on the generic data can be re-used for the task of interest. Even though this pre-training works reasonably well in practice, it still falls short of the ability to both quickly grasp new concepts and understand them in different contexts. In a similar spirit to howBERTandT5have shown advances in the language domain, we believe thatlarge-scalepre-training can advance the performance of computer vision models.
In “Big Transfer (BiT): General Visual Representation Learning” we devise an approach for effective pre-training of general features using image datasets at a scale beyond the de-facto standard (ILSVRC-2012). In particular, we highlight the importance of appropriately choosing normalization layers and scaling the architecture capacity as the amount of pre-training data increases. Our approach exhibits unprecedented performance adapting to a wide range of new visual tasks, including the few-shot recognition setting and the recently introduced “real-world” ObjectNet benchmark. We are excited toshare the best BiT modelspre-trained on public datasets, along withcode in TF2, Jax, and PyTorch. This will allow anyone to reach state-of-the-art performance on their task of interest, even with just a handful of labeled images per class.