In the last few years, research in natural language generation (NLG) has made tremendous progress, with models now able totranslate text,summarize articles,engage in conversation, andcomment on pictureswith unprecedented accuracy, using approaches withincreasinglyhigh levelsof sophistication. Currently, there are two methods to evaluate these NLG systems: human evaluation and automatic metrics. With human evaluation, one runs a large-scale quality survey for each new version of a model using human annotators, but that approach can be prohibitively labor intensive. In contrast, one can use popular automatic metrics (e.g.,BLEU), but these are oftentimesunreliable substitutesfor human interpretation and judgement. The rapid progress of NLG and the drawbacks of existing evaluation methods calls for the development of novel ways to assess the quality and success of NLG systems.
In “BLEURT: Learning Robust Metrics for Text Generation” (presented duringACL 2020), we introduce a novel automatic metric that delivers ratings that are robust and reach an unprecedented level of quality, much closer to human annotation. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) builds upon recent advances intransfer learningto capture widespread linguistic phenomena, such as paraphrasing. The metric is available onGithub.