Updated: Apr 16, 2020
Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.
* We conduct a larges-scale comparative study on Transformer and RNN with significant performance gains especially for the ASR related tasks.
* We explain our training tips for Transformer in speech applications: ASR, TTS and ST.
* We provide reproducible end-to-end recipes and models pretrained on a large number of publicly available datasets in our open source toolkit ESPnet .
Because the unidirectional decoder is useful for sequence generation, its attention matrices at the t-th target frame are masked so that they do not connect with future frames later than t.
The source X in ASR is represented as a sequence of 83-dim log-mel filterbank frames with pitch features - 语音信息提取
During ASR training, both the decoder and the CTC module predict the frame-wise posterior distribution of Y given corresponding source X: ps2s(Y |X) and pctc(Y |X), respectively. We simply use the weighted sum of those negative log likelihood values - 训练阶段两个loss加权求和
In the decoding stage, the decoder predicts the next token given the speech feature X and the previous predicted tokens using beam search, which combines the scores of S2S, CTC and the RNN language model (LM) - 解码阶段除了ctc和s2s以外还需要一个语言模型
Transformer requires a different optimizer configuration from RNN because Transformer’s training iteration is eight times faster and its update is more fine-grained than RNN. - Transformer更快收敛
To train Transformer, we basically followed the previous literature (e.g., dropout, learning rate, warmup steps). We did not use development sets for early stopping in Transformer. We simply ran 20 – 200 epochs (mostly 100 epochs) and averaged the model parameters stored at the last 10 epochs as the final model. - 调参方法主要参考之前论文方法，参数是最后10个epoch的模型的求平均。
In the decoding stage, Transformer and RNN share the same configuration for each corpus, for example, beam size (e.g., 20 – 40), CTC weight λ (e.g., 0.3), and LM weight γ (e.g., 0.3 – 1.0) - 解码阶段Transformer和RNN用相同的配置。
It shows that Transformer outperforms RNN on 13/15 corpora in our experiment. Although our system has no pronunciation dictionary, part-of-speech tag nor alignment-based data cleaning unlike Kaldi, our Transformer provides comparable CER/WERs to the HMM-based system, Kaldi on 7/12 corpora. We conclude that Transformer has ability to outperform the RNN-based end-to-end system and the DNN/HMM-based system even in low resource (JSUT), large resource (LibriSpeech, CSJ), noisy (AURORA4) and far-field (REVERB) tasks. - 非常好的实验结果
We observed that Transformer trained with a larger minibatch became more accurate while RNN did not. - Transformer更大的minibatch效果更好
In this task, Transformer achieved the best accuracy provided by RNN about eight times faster than RNN with a single GPU. - 8倍速于RNN
* When Transformer suffers from under-fitting, we recommend increasing the minibatch size because it also results in a faster training time and better accuracy simultaneously unlike any other hyperparameters. - 扩大minibatch的大小
* The accumulating gradient strategy  can be adopted to emulate the large minibatch if multiple GPUs are unavailable. - 使用accumulating gradient strategy
* While dropout did not improve the RNN results, it is essential for Transformer to avoid over-fitting. - 使用dropout
* We tried several data augmentation methods , . They greatly improved both Transformer and RNN. - 使用数据增强方法可以在很大程度上提升性能
* The best decoding hyperparameters γ, λ for RNN are generally the best for Transformer. 解码阶段RNN和Transformer在相同参数集上性能都比较好