Recently, analysis of deep neural networks has been an active topic of research. While previous work mostly used so-called 'probing tasks' and has made some interesting observations (we will mention some of these a bit later), an explanation of the process behind the observed behavior has been lacking.
In our paper, we attempt to explain more generally why such behavior is observed. Instead of measuring the quality of representations obtained from a model on some auxiliary task, we characterize how the learning objective determines the information flow in the model. In particular, we consider how the representations of individual tokens in the Transformer evolve between layers under different learning objectives. We look at this task from the information bottleneck perspective on learning in neural networks.
In addition to a number of interesting results which give insights into internal workings of Transformer trained with different objectives, we try to explain the superior performance of the MLM (BERT-style) objective over the LM one for pretraining.