The ever increasing size of NLP models, and the reduced usability that ensues, is something I’ve discussed in many previous paper summaries (TinyBERT, MobileBERT, and DistilBERT are some of these). Each of these papers proposes a unique knowledge distillation framework with the common goal of reducing model size while preserving performance.
While these methods all have been successful in their respective ways, there exists a common drawback: knowledge distillation requires additional training after an already expensive teacher training which limits the usefulness of these techniques to inference time.
An alternative method to knowledge distillation, which provides a simple solution to this issue, is pruning. Previous works (“Are Sixteen Heads Really Better than One?” and “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned”) have shown that it is possible for transformer based architectures to drop some attention heads during inference, without significant reduction of performance.
Continuing this line of thought, what would happen with model performance if we drop entire transformer layers from our pre-trained model? Would the resulting model be usable for further fine-tuning? Does the performance differ depending on which layers we drop? These are some of the questions the authors in Poor Man’s BERT: Smaller and Faster Transformer Models analyze, which I will provide a summary of in this article. Let’s get into their contributions!