Google Brain Uncovers Representation Structure Differences Between CNNs and Vision Transformers
A Google Brain research team explores the internal representation structures of ViTs and CNNs on image classification tasks, providing insights on key differences between the two approaches.
Although convolutional neural networks (CNNs) have dominated the field of computer vision for years, new vision transformer models (ViTs) have also shown remarkable abilities, achieving comparable and even better performance than CNNs on many computer vision tasks. The success of ViTs has raised a number of questions: How are ViTs solving these image-based tasks? Do they act like convolutions, learning the same inductive biases from scratch? Or are they developing novel task representations? And what role does scale play in learning these representations?
To find the answers, a new Google Brain paper explores the internal representation structures of ViTs and CNNs on image classification tasks, providing some surprising insights on the differences between ViTs and CNNs.
The team summarizes their main contributions as:
- We investigate the internal representation structure of ViTs and CNNs, finding striking differences between the two models, such as ViTs having more uniform representations, with greater similarity between lower and higher layers.
- Analyzing how local/global spatial information is utilized, we find ViTs incorporate more global information than ResNet at lower layers, leading to quantitatively different features.
- Nevertheless, we find that incorporating local information at lower layers remains vital, with large-scale pretraining data helping early attention layers learn to do this.
- We study the uniform internal structure of ViTs, finding that skip connections in ViTs are even more influential than in ResNets, and have strong effects on performance and representation similarity.
- Motivated by potential future uses in object detection, we examine how well input spatial information is preserved, finding connections between spatial localization and methods of classification.
- We study the effects of dataset scale on transfer learning, with a linear probes study revealing its importance for high-quality intermediate representations.
Analyzing the layer representations of a neural network is challenging, as features are distributed across a large number of neurons. This distribution makes it even more challenging to compare representations across different neural networks. To address these issues, previous studies have proposed centred kernel alignment (CKA) to enable quantitative comparisons of representations within and across networks.
In this work, the Google Brain team uses the CKA approach to study the internal representation structure of each model, taking every pair of layers within a model and computing their CKA similarity. The results show that there is a clear difference between ViTs and CNNs in this regard. The researchers also conduct cross-model comparisons of all ViT and ResNet layers.
The study reveals that ViT lower layers compute representations differently than ResNet lower layers; ViTs more strongly propagate representations between lower and higher layers; and, compared to ResNet, the highest ViT layers produce very different visual representations.
The team then explores the local and global information in layer representations, finding that using local information early on for image tasks (as hardcoded into CNN architectures) is important for strong performance. They also find that access to more global information leads to quantitatively different features than those computed by local receptive fields in the lower layers of the ResNet. Additional discoveries from the study are that lower layer effective receptive fields for ViTs are larger than those in ResNets; and that ViT receptive fields become much more global midway through the network, while ResNet effective receptive fields grow gradually.
The study also shows that ViT skip connections in representation propagation are more influential than those in ResNets, and can have a strong effect on performance and representation similarity. Regarding the spatial information and localization properties of the two approaches, the researchers discover that ViTs with CLS tokens enable strong preservation of the spatial information, suggesting their promising potential for future uses in object detection.
Finally, tests on the effects of scale in transfer learning reveal that larger ViT models develop significantly stronger intermediate representations through larger pretraining datasets.
Overall, the paper provides many valuable insights on the differences between ViTs and CNNs in computer vision, along with detailed descriptions of just how ViTs are solving image classification tasks.