最近准备入门语音识别，发现了一篇比较好的综述文章，名字叫An Overview of End-to-End Automatic Speech Recognition，限于技术不足，翻译的话可能会有不少误差，这里我想把文章里面的一些重点信息挑出来，希望能够简化论文阅读者的阅读量和理解难度。
For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. - DNN的在语音识别的兴起
However, the HMM-DNN model itself is limited by various unfavorable factors such as data forced segmentation alignment, independent hypothesis, and multi-module individual training inherited from HMM, while the end-to-end model has a simplified model, joint training, direct output, no need to force data alignment and other advantages. - HMM-DNN和E2E的简要对比。
In a large vocabulary continuous speech recognition task, the hidden Markov model (HMM)-based model has always been mainstream technology, and has been widely used. Even today, the best speech recognition performance still comes from HMM-based model (in combination with deep learning techniques). Most industrially deployed systems are based on HMM. - HMM的重要地位。
It replaces engineering process with learning process and needs no domain expertise, so end-to-end model is simpler for constructing and training. These advantages make the end-to-end model quickly become a hot research direction in large vocabulary continuous speech recognition (LVCSR). - 端到端模型的优越之处以及发展潜力。
the SPHINX system  developed by Kai-Fu Lee of Carnegie–Méron University, which uses HMM to model the speech state over time and uses GMM to model HMM states’ observation probability, made a breakthrough in LVCSR and is considered a milestone in the history of speech recognition. - 李开复的SPHINX的划时代意义。
In 2011, Yu Dong, Deng Li, etc. from Microsoft Research Institute proposed a hidden Markov model combined with context-based deep neural network which named context-dependent (CD)-DNN-HMM . It achieved significant performance gains compared to traditional HMM-GMM system in LVCSR task. Since then, LVCSR technology using deep learning has begun to be widely studied. - 微软的深度模型的引入为语音识别做出巨大贡献。
Based on the differences in their basic ideas and key technologies, LVCSR can be divided into two categories: HMM-based model and the end-to-end model. - LVCSR可以被划分为两类：HMM-based和端到端。
In general, the HMM-based model can be divided into three parts, each of which is independent of each other and plays a different role: acoustic, pronunciation and language model. - HMM-based模型分为3部分。
The construction process and working mode of the HMM-based model determines if it faces the following difficulties in practical use:
* The training process is complex and difficult to be globally optimized. - 每个部分独立，有各自的优化目标，很难找到全局最优。
* Conditional independent assumptions. - 条件独立假设，不符合实际认知。
the end-to-end model replaces multiple modules with a deep network, realizing the direct mapping of acoustic signals into label sequences without carefully-designed intermediate states. Besides, there is no need to perform posterior processing on the output. - e2e结构，一路到底。
* Multiple modules are merged into one network for joint training. - 优点1，联合训练。
* It directly maps input acoustic signature sequence to the text result sequence, and does not require further processing to achieve the true transcription or to improve recognition performance - 优点2，简单。
To varying degrees, sequence-to-sequence tasks face data alignment problems, especially for speech recognition. The end-to-end model uses soft alignment. The end-to-end model can be divided into three different categories depending on their implementations of soft alignment: - 端到端模型有自己的对齐问题，根据对齐方式的不同，分为以下三类：
* CTC-based: CTC first enumerates all possible hard alignments (represented by the concept path), then it achieves soft alignment by aggregating these hard alignments. CTC assumes that output labels are independent of each other when enumerating hard alignments.
* RNN-transducer: it also enumerates all possible hard alignments and then aggregates them for soft alignment. But unlike CTC, RNN-transducer does not make independent assumptions about labels when enumerating hard alignments, so it is different from CTC in terms of path definition and probability calculation.
* Attention-based: this method no longer enumerates all possible hard alignments, but uses Attention mechanism to directly calculate the soft alignment information between input data and output label.
When attempting to model time-domain features using RNN or CNN instead of HMM, it faces a data alignment problem: both RNN and CNN’s loss functions are defined at each point in the sequence, so in order to be able to perform training, it is necessary to know the alignment relation between RNN output sequence and target sequence - 输入和输出的数据对齐问题。
CTC was proposed in . Its emergence makes it possible to make fuller use of DNN in speech recognition and build end-to-end models, which is a breakthrough in the development of end-to-end method. Essentially, CTC is a loss function, but it solves hard alignment problem while calculating the loss. - CTC解决了以下两个问题使得端到端问题可以被解决。
* Data alignment problem.
* Directly output the target transcriptions.
CTC process can be seen as including two sub-processes: path probability calculation and path aggregation. In these two sub-processes, the most important is the introduction of a new blank label (“-”, which means no output) and the intermediate concept path.
From the calculation process of Equation (6), we can see that there is a very important assumption, which is the independence assumption: elements in the output sequence are independent of each other. Any time step which label is selected as the output does not affect the label distribution at other time steps. In contrast, in the encoding process, the value of y^k_t is affected by the speech context information in both historical and future directions. That is to say, CTC uses conditional independence assumptions in language models, but not in acoustic models. Therefore, the encoder obtained by CTC training is essentially and totally an acoustic model, which does not have the ability to model language. - 独立性假设存在问题
From the path probability calculation process, we can find that the output path’s length is equal to the input speech sequence’s, which is not in line with actual situation. - 第一步计算出来的结果输入输出长度相同，有问题，path aggregation可以通过以下两步解决。
* Merge the same contiguous labels.
* Delete the blank label “-” in the path.
The emergence of CTC technology greatly simplifies the construction and training of LVCSR model. It no longer needs expertise to build various dictionaries; it eliminates the need of data alignment, allowing us to use any number of layers, any network structure to build an end-to-end model mapping audio directly to text. - CTC的贡献。
(1) it created its own all-reduce open message passing interface (OpenMPI) code to sum gradients from different GPUs on different nodes; (2) it designed an efficient CTC calculation method to run on GPU; (3) it designed and used a new memory allocation method. Finally, the model achieved 4–21 times acceleration.
Results showed that the language model can greatly improve recognition accuracy. Not only that, but the language model is also effective even for models with complex structure and large training data such as DeepSpeech  and Deepspeech2 . - 语言模型不仅在小数据集上有贡献，大数据集上也有很大提升。
There are two main ways to use the language model: second-pass and first-pass. - 两个方法去融入语言模型。
However, introducing language model also has its shortcomings. On the one hand, introduction of language model makes the CTC-based works deviate from the end-to-end principle, and the characteristics of joint training using an individual model are destroyed. The language model only works in the prediction phase and does not help the training process. On the other hand, the language model is very large. For example, the seven-gram language model used in  is 21 GB, which has a great impact on model’s deployment and delay. - 融入语言模型的比较大的弊端。
* CTC cannot model interdependencies within the output sequence because it assumes that output elements are independent of each other. - 弊端1
* CTC can only map input sequences to output sequences that are shorter than it. - 弊端2
Theoretically, it can map an input to any finite, discrete output sequence. Interdependencies between input and output and within output elements are also jointly modeled. - RNN-Transducer可以解决这两个问题。
* Since one input data can generate a label sequence of arbitrary length, theoretically, the RNN-transducer can map input sequence to an output sequence of arbitrary length, whether it is longer or shorter than the input. - 支持任意长度
* Since the prediction network is an RNN structure, each state update is based on previous state and output labels. Therefore, the RNN-transducer can model the interdependence within output sequence, that is, it can learn the language model knowledge. - 可以对输出序列的依赖关系建模。
* Since Joint Network uses both language model and acoustic model output to calculate probability distribution, RNN-Transducer models the interdependence between input sequence and output sequence, achieving joint training of language model and the acoustic model. - 可以联合训练语言模型和声学模型。
* Experiments show that the RNN-transducer is not easy to train. - 训练起来有点难。
* Thd RNN-transducer’s calculation process includes many obviously unreasonable paths. - 会出来很奇怪的结果。
This work has greatly promoted speech recognition because it fits well with speech recognition tasks in the following ways:
* Speech recognition is also a sequence-to-sequence process that recognizes the output sequence from the input sequence. So it is essentially the same as translation task. - 跟翻译模型很像，翻译模型效果不错，应该被用到语音识别上。
* The encoder–decoder method using an attention mechanism does not require pre-segment alignment of data. With attention, it can implicitly learn the soft alignment between input and output sequences, which solves a big problem for speech recognition. - 不需要数据对齐。
* Encoding result is no longer limited to a single fixed-length vector, the model can still have a good effect on long input sequence, so it is also possible for such model to handle speech input of various lengths. - 对长文本也有不错的效果。
* Delay and Information Redundancy - 因为attention是对整个encoding结果序列做的，所以需要等待整个encoding过程结束才能计算，导致了比较大的延迟。
on the one hand, a longer encoding result sequence means more attention calculation, thereby increasing the delay; on the other hand, since speech is much larger than transcription, the sequence generated by encoding process without sub sampling will introduce a lot of redundant information to the attention mechanism. - 延迟和信息冗余问题。
* Continuity Problem - 翻译模型里面两个相邻的单词可能依赖的输入序列相隔很远，而语音识别没有，所以就需要对翻译模型做些改变。
* Monotonic Problem - 语音识别是单向模型。
* Inaccurate Extraction of Key Information - 一旦声音长度比训练数据长度长很多的时候，性能急剧下降。
* Delay - 还是延时问题。
However, the HMM-DNN model itself is limited by various unfavorable factors such as independent hypothesis and multi-module individual training inherited from HMM, while the end-to-end model has advantages such as simplified model, joint training, direct output, No need for forced data alignment. Therefore, the end-to-end model is the current focus of LVCSR and an important research direction in the future. - 端到端模型是当前研究热点。