In the past few years, the fact that NLP has gained momentum is the least one can say. The increasing performance of NLP models at an ever-growing number of tasks and the rising attention this field has attracted led Sebastian Ruder from DeepMind to talk about an ImageNet moment of NLP in July 2018, referring to a similar booming phase for the computer vision field in 2012. Clément Delangue from Hugging Face even described NLP as the most important field of machine learning (ML) at Data Driven NYC in January of this year.
Now, it is one thing to say that models are improving but what’s really important is to know which tasks can be done with the current state-of-the-art (SOTA) models and how model and human performance compare. In this blog post, we answer these questions by presenting the General Language Understanding Evaluation (GLUE) benchmark and by analyzing performance on each task to differentiate easy and hard tasks for current SOTA models. Finally, we present the latest benchmarks to date, SuperGLUE but also XTREME, set up in order to keep up with rising model performance and evaluate new models on other languages than English.