In the last few years, NLP models have made a big leap in most machine learning tasks. BERT or BERT-based models are the most popular NLP models currently. The typical pipeline has changed a lot. Instead of training models from scratch, we can fine-tune the pre-trained model for our specific task and get about SOTA results. But there is one big problem, even in the simple BERT-based model. There are about 110 million parameters which could decrease inference speed. It is not a problem at all if you have one dataset to predict, but if you need your model to process the request quickly and send the response online or even want to integrate your model into mobile app model size could be a critical factor for you. But what if we could get our model smaller without losing quality?
Three basic technics allow us to do so:
In this blog post, we will focus on distillation.