Fine-tuning models has emerged as one of the popular approaches to natural language processing (NLP). Rather than train models from random initialization, that is from scratch, models are repurposed: previous model parameters are loaded as initializations and training continues on data from a new distribution. The success of BERT is indicative of this trend. In this blog, we ask “what are the security implications for NLP models trained in this way?” In particular, we take a look at model extraction attacks where the adversary queries a model with the hope of stealing it.
Typically, queries made by the adversary to steal a model need to be constructed from data that is sampled from the same distribution than the victim model was trained on. However, we present in this blog post results from our paper “Thieves on Sesame Street! Model Extraction of BERT-based APIs”, which will be presented at ICLR 2020. In this paper, we showed that it’s possible to steal BERT-based natural language processing models without access to any real input training data. Our adversary feeds nonsensical randomly-sampled sequences of words to a victim model and then fine-tunes their own BERT on the labels predicted by the victim model.
The effectiveness of our method exposes the risk of publicly-hosted NLP inference APIs being stolen if they were trained by fine-tuning. Malicious users could spam APIs with random queries and then use the outputs to reconstruct a copy of the model, thus mounting a model extraction attack.
In this work, we studied model extraction attacks in natural language processing. We efficiently extracted both text classifiers and question answering (QA) models. These attacks are quite simple and should be treated as lower bounds to more sophisticated attacks leveraging active learning. We hope that this work highlights the need for more research in the development of effective countermeasures to defend against model extraction, or at least to increase the cost of adversaries. Defenses should strive to have a minimal impact on legitimate users of the model. Besides work on attack-defense mechanisms, we see two other avenues for future research:
1) Improving Distillation - Since distillation is possible with random sequences of tokens, our approach might be a good way to perform distillation in low-resource NLP settings where the original training data cannot be made available. Random sequences (perhaps with the intelligent data selection strategy) could also be used to augment real training data for distillation.
2) Closeness of Input Distributions - Model extraction might be a good way to understand the closeness between two input distributions, where one input distribution is used to extract a model trained on another input distribution. This technique could be used as a method to tackle an important open problem in NLP (“What is a domain?”).