挑重点 - A Comparative Study on Transformer vs RNN in Speech Applications

Updated: Apr 16

作为把Transformer应用在语音领域上与RNN对比的文章,结果也是比较喜人,最爽的是它在ESPnet上面开源了,模型、代码都给出来了,还给出了各种训练方法和技巧,是一篇实用性很强的文章,点赞。这里只给出ASR重点,其他的语音研究方向没有涉及。


摘要里提到:

Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.

也是文章的主要贡献所在。


文章介绍里给出更详细的论文贡献,主要有三点:

* We conduct a larges-scale comparative study on Transformer and RNN with significant performance gains especially for the ASR related tasks.

* We explain our training tips for Transformer in speech applications: ASR, TTS and ST.

* We provide reproducible end-to-end recipes and models pretrained on a large number of publicly available datasets in our open source toolkit ESPnet .

巨大的性能提升,应用于ASR/TTS/ST,可复现的开源代码和模型。


文章第二部分、第三部分分别介绍了RNN和Transformer的结构,列出了一些公式,这里不做详述,只给出一个图:

Because the unidirectional decoder is useful for sequence generation, its attention matrices at the t-th target frame are masked so that they do not connect with future frames later than t.


第四部分介绍了Transformer应用于ASR的方法,这里要注意的是以下一些信息:

The source X in ASR is represented as a sequence of 83-dim log-mel filterbank frames with pitch features - 语音信息提取

During ASR training, both the decoder and the CTC module predict the frame-wise posterior distribution of Y given corresponding source X: ps2s(Y |X) and pctc(Y |X), respectively. We simply use the weighted sum of those negative log likelihood values - 训练阶段两个loss加权求和

In the decoding stage, the decoder predicts the next token given the speech feature X and the previous predicted tokens using beam search, which combines the scores of S2S, CTC and the RNN language model (LM) - 解码阶段除了ctc和s2s以外还需要一个语言模型


第五部分、第六部分是本人未涉足领域,此处略过。


第七部分是重点,是ASR相关的实验以及调参方法。

Transformer requires a different optimizer configuration from RNN because Transformer’s training iteration is eight times faster and its update is more fine-grained than RNN. - Transformer更快收敛

To train Transformer, we basically followed the previous literature (e.g., dropout, learning rate, warmup steps). We did not use development sets for early stopping in Transformer. We simply ran 20 – 200 epochs (mostly 100 epochs) and averaged the model parameters stored at the last 10 epochs as the final model. - 调参方法主要参考之前论文方法,参数是最后10个epoch的模型的求平均。

In the decoding stage, Transformer and RNN share the same configuration for each corpus, for example, beam size (e.g., 20 – 40), CTC weight λ (e.g., 0.3), and LM weight γ (e.g., 0.3 – 1.0) - 解码阶段Transformer和RNN用相同的配置。


It shows that Transformer outperforms RNN on 13/15 corpora in our experiment. Although our system has no pronunciation dictionary, part-of-speech tag nor alignment-based data cleaning unlike Kaldi, our Transformer provides comparable CER/WERs to the HMM-based system, Kaldi on 7/12 corpora. We conclude that Transformer has ability to outperform the RNN-based end-to-end system and the DNN/HMM-based system even in low resource (JSUT), large resource (LibriSpeech, CSJ), noisy (AURORA4) and far-field (REVERB) tasks. - 非常好的实验结果


We observed that Transformer trained with a larger minibatch became more accurate while RNN did not. - Transformer更大的minibatch效果更好

In this task, Transformer achieved the best accuracy provided by RNN about eight times faster than RNN with a single GPU. - 8倍速于RNN


一些训练小技巧奉上:

* When Transformer suffers from under-fitting, we recommend increasing the minibatch size because it also results in a faster training time and better accuracy simultaneously unlike any other hyperparameters. - 扩大minibatch的大小

* The accumulating gradient strategy [5] can be adopted to emulate the large minibatch if multiple GPUs are unavailable. - 使用accumulating gradient strategy

* While dropout did not improve the RNN results, it is essential for Transformer to avoid over-fitting. - 使用dropout

* We tried several data augmentation methods [26], [27]. They greatly improved both Transformer and RNN. - 使用数据增强方法可以在很大程度上提升性能

* The best decoding hyperparameters γ, λ for RNN are generally the best for Transformer. 解码阶段RNN和Transformer在相同参数集上性能都比较好

最后指出一个问题:速度太慢了,需要一个更快的解码速度。


实验结果:



论文地址:https://arxiv.org/abs/1909.06317

©2020 by EasyCSTech. Special thanks to IPinfo​ and EasyNote.