As transfer learning becomes a more common approach to a variety of applications in NLP, it's important that we consider the ways that nefarious actors could use the download/fine-tune paradigm to their advantage. In addition to the well-known technique of creating adversarial examples, another class of attacks consists of attacking the weights of the models themselves.
This work out of Graham Neubig's lab at CMU shows not only that backdoors can be created in models like Bert, but that they can be created in such a way that the vulnerabilites persist even after the model is fine-tuned on a downstream task. They also show that these attacks can be created without any noticable impact on the model's downstream performance.
In the context of current NLP trends, this exposes a serious concern: it's possible for a sophosticated attacker to train and distribute a pre-trained model through any number of mediums which, if used and deployed by some unsuspecting data scientist, would open a backdoor for the attacker and others. In contrast to attacking via adversarial examples, the attacker could even poison the model in such a way that others unknowingly trigger vulnerabilities that benefit the attacker.
This work specifically studies trigger keywords for classification. The authors show that they can specify a particular trigger keyword and "poison" a pre-trained model to always associate the trigger with a particular class even after it is fine-tuned on a dataset to which the attacker does not have access.
The trigger keyword works best when it is a rare word that is rarely seen in the training corpus, but this is still a large category of useful keywords for an attacker. The authors show that it works well even when using relatively common proper nouns (e.g. "Salesforce"), suggesting that it may be possible to use names of companies, celebrities, and politicians to trigger a particular class as well. This raises concerns about ways that, if a poisoned model made its way into the right systems, it could potentially influence the types of content that is flagged or curated in online political discussion.
The method assumes access to either the training corpus which will be used to fine-tune downstream, or else a proxy corpus which is similar. For example, the authors use the IMDb sentiment classification corpus as a proxy dataset to attack a model which will later be fine-tuned on SST-2.