The rapid success of transformer-based models within the NLP field has spawned a research field onto itself — how are these models able to achieve such high performance on so many diverse tasks?
One way to shed some light on this topic is to examine what linguistic features have been encoded in the model after pre-training. Previous work has approached this through both studying the output embeddings as well as internal representations. This leaves an interesting component of Transformer based models untouched — the attention mechanism. The information encoded by this mechanism is what Clark et al. thoroughly examined in their paper What does BERT look at? An analysis of BERT’s attention. In this article, I will try to present an overview of their findings and a discussion of the implications mentioned.
The results are divided into three sections, matching how Clark et al. approached their research. First, we will cover surface-level attention patterns to find the average behavior of heads across each layer. Then, the study focuses on the linguistic features learned by each individual attention head. Finally, we will discuss the attempt of utilizing the results from previous sections through a so-called probing classifier.