语音识别最新综述

最近准备入门语音识别,发现了一篇比较好的综述文章,名字叫An Overview of End-to-End Automatic Speech Recognition,限于技术不足,翻译的话可能会有不少误差,这里我想把文章里面的一些重点信息挑出来,希望能够简化论文阅读者的阅读量和理解难度。


文章结构分为以下几个部分:摘要、介绍、背景、基于CTC的端到端语音识别模型、基于RNN-Transducer的端到端语音识别模型、基于Attention的端到端语音识别模型、比较和总结,最后是工作展望和引用。下面我也根据上述几个部分来依次划重点。


摘要部分

简单介绍了ASR的历史,以及HMM-DNN的缺陷,引入端到端模型的原因,然后就是论文的结构。

For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. - DNN的在语音识别的兴起

However, the HMM-DNN model itself is limited by various unfavorable factors such as data forced segmentation alignment, independent hypothesis, and multi-module individual training inherited from HMM, while the end-to-end model has a simplified model, joint training, direct output, no need to force data alignment and other advantages. - HMM-DNN和E2E的简要对比。


介绍部分

介绍了语言模型的基本公式和一些基本概念。

In a large vocabulary continuous speech recognition task, the hidden Markov model (HMM)-based model has always been mainstream technology, and has been widely used. Even today, the best speech recognition performance still comes from HMM-based model (in combination with deep learning techniques). Most industrially deployed systems are based on HMM. - HMM的重要地位。

It replaces engineering process with learning process and needs no domain expertise, so end-to-end model is simpler for constructing and training. These advantages make the end-to-end model quickly become a hot research direction in large vocabulary continuous speech recognition (LVCSR). - 端到端模型的优越之处以及发展潜力。


背景部分

这里面第一部分是介绍ASR的历史,从第一个语音识别到贝尔实验室出来的真正的语音识别到SPHINX再到最新进展。

the SPHINX system [13] developed by Kai-Fu Lee of Carnegie–Méron University, which uses HMM to model the speech state over time and uses GMM to model HMM states’ observation probability, made a breakthrough in LVCSR and is considered a milestone in the history of speech recognition. - 李开复的SPHINX的划时代意义。

In 2011, Yu Dong, Deng Li, etc. from Microsoft Research Institute proposed a hidden Markov model combined with context-based deep neural network which named context-dependent (CD)-DNN-HMM [16]. It achieved significant performance gains compared to traditional HMM-GMM system in LVCSR task. Since then, LVCSR technology using deep learning has begun to be widely studied. - 微软的深度模型的引入为语音识别做出巨大贡献。


第二部分简单介绍了为LVCSR做出了巨大贡献的基于HMM的模型和端到端的模型。

Based on the differences in their basic ideas and key technologies, LVCSR can be divided into two categories: HMM-based model and the end-to-end model. - LVCSR可以被划分为两类:HMM-based和端到端。


HMM-Based Model

简单介绍基于HMM的模型。

In general, the HMM-based model can be divided into three parts, each of which is independent of each other and plays a different role: acoustic, pronunciation and language model. - HMM-based模型分为3部分。

The construction process and working mode of the HMM-based model determines if it faces the following difficulties in practical use:

* The training process is complex and difficult to be globally optimized. - 每个部分独立,有各自的优化目标,很难找到全局最优。

* Conditional independent assumptions. - 条件独立假设,不符合实际认知。


End-to-End Model

简单介绍端到端模型结构以及相对于HMM-based的一些优点。

the end-to-end model replaces multiple modules with a deep network, realizing the direct mapping of acoustic signals into label sequences without carefully-designed intermediate states. Besides, there is no need to perform posterior processing on the output. - e2e结构,一路到底。

* Multiple modules are merged into one network for joint training. - 优点1,联合训练。

* It directly maps input acoustic signature sequence to the text result sequence, and does not require further processing to achieve the true transcription or to improve recognition performance - 优点2,简单。

To varying degrees, sequence-to-sequence tasks face data alignment problems, especially for speech recognition. The end-to-end model uses soft alignment. The end-to-end model can be divided into three different categories depending on their implementations of soft alignment: - 端到端模型有自己的对齐问题,根据对齐方式的不同,分为以下三类:

* CTC-based: CTC first enumerates all possible hard alignments (represented by the concept path), then it achieves soft alignment by aggregating these hard alignments. CTC assumes that output labels are independent of each other when enumerating hard alignments.

* RNN-transducer: it also enumerates all possible hard alignments and then aggregates them for soft alignment. But unlike CTC, RNN-transducer does not make independent assumptions about labels when enumerating hard alignments, so it is different from CTC in terms of path definition and probability calculation.

* Attention-based: this method no longer enumerates all possible hard alignments, but uses Attention mechanism to directly calculate the soft alignment information between input data and output label.


基于CTC的端到端模型

CTC的引入

When attempting to model time-domain features using RNN or CNN instead of HMM, it faces a data alignment problem: both RNN and CNN’s loss functions are defined at each point in the sequence, so in order to be able to perform training, it is necessary to know the alignment relation between RNN output sequence and target sequence - 输入和输出的数据对齐问题。

CTC was proposed in [25]. Its emergence makes it possible to make fuller use of DNN in speech recognition and build end-to-end models, which is a breakthrough in the development of end-to-end method. Essentially, CTC is a loss function, but it solves hard alignment problem while calculating the loss. - CTC解决了以下两个问题使得端到端问题可以被解决。

* Data alignment problem.

* Directly output the target transcriptions.


CTC核心思想

CTC process can be seen as including two sub-processes: path probability calculation and path aggregation. In these two sub-processes, the most important is the introduction of a new blank label (“-”, which means no output) and the intermediate concept path.

然后就是分别介绍CTC的两个部分,个人感觉理解起来不是很难,如果有难度可以再找下网上的资料辅助理解一下。

From the calculation process of Equation (6), we can see that there is a very important assumption, which is the independence assumption: elements in the output sequence are independent of each other. Any time step which label is selected as the output does not affect the label distribution at other time steps. In contrast, in the encoding process, the value of y^k_t is affected by the speech context information in both historical and future directions. That is to say, CTC uses conditional independence assumptions in language models, but not in acoustic models. Therefore, the encoder obtained by CTC training is essentially and totally an acoustic model, which does not have the ability to model language. - 独立性假设存在问题

From the path probability calculation process, we can find that the output path’s length is equal to the input speech sequence’s, which is not in line with actual situation. - 第一步计算出来的结果输入输出长度相同,有问题,path aggregation可以通过以下两步解决。

* Merge the same contiguous labels.

* Delete the blank label “-” in the path.

The emergence of CTC technology greatly simplifies the construction and training of LVCSR model. It no longer needs expertise to build various dictionaries; it eliminates the need of data alignment, allowing us to use any number of layers, any network structure to build an end-to-end model mapping audio directly to text. - CTC的贡献。


基于CTC的模型

这一部分介绍了很多基于CTC的模型,以及它们的不断迭代更新刷榜,最主要的还是更深的网络、更多的数据。当然还有一些让网络更浅更小的一些研究,使得性能更好。

在大数据集上的研究主要集中在数据并行和模型并行上,以下是一些进展:

(1) it created its own all-reduce open message passing interface (OpenMPI) code to sum gradients from different GPUs on different nodes; (2) it designed an efficient CTC calculation method to run on GPU; (3) it designed and used a new memory allocation method. Finally, the model achieved 4–21 times acceleration.

还有一部分是介绍语言模型对CTC的重大贡献。

Results showed that the language model can greatly improve recognition accuracy. Not only that, but the language model is also effective even for models with complex structure and large training data such as DeepSpeech [19] and Deepspeech2 [32]. - 语言模型不仅在小数据集上有贡献,大数据集上也有很大提升。

There are two main ways to use the language model: second-pass and first-pass. - 两个方法去融入语言模型。

However, introducing language model also has its shortcomings. On the one hand, introduction of language model makes the CTC-based works deviate from the end-to-end principle, and the characteristics of joint training using an individual model are destroyed. The language model only works in the prediction phase and does not help the training process. On the other hand, the language model is very large. For example, the seven-gram language model used in [29] is 21 GB, which has a great impact on model’s deployment and delay. - 融入语言模型的比较大的弊端。


基于RNN-Transducer的端到端模型

CTC弊端

* CTC cannot model interdependencies within the output sequence because it assumes that output elements are independent of each other. - 弊端1

* CTC can only map input sequences to output sequences that are shorter than it. - 弊端2

Theoretically, it can map an input to any finite, discrete output sequence. Interdependencies between input and output and within output elements are also jointly modeled. - RNN-Transducer可以解决这两个问题。

RNN-Transducer核心思想

理解过RNN模型的同学看这一部分应该也不是很困难,这里贴下模型结构图:

然后得出以下结论:

* Since one input data can generate a label sequence of arbitrary length, theoretically, the RNN-transducer can map input sequence to an output sequence of arbitrary length, whether it is longer or shorter than the input. - 支持任意长度

* Since the prediction network is an RNN structure, each state update is based on previous state and output labels. Therefore, the RNN-transducer can model the interdependence within output sequence, that is, it can learn the language model knowledge. - 可以对输出序列的依赖关系建模。

* Since Joint Network uses both language model and acoustic model output to calculate probability distribution, RNN-Transducer models the interdependence between input sequence and output sequence, achieving joint training of language model and the acoustic model. - 可以联合训练语言模型和声学模型。


RNN-Transducer相关模型

这一部分介绍了很多RNN-Transducer相关的模型,以及它们的不断迭代更新刷榜。文中提到了对模型的各个层次的预训练对性能提升很好,更深的层次对性能有提升,但问题也有:

* Experiments show that the RNN-transducer is not easy to train. - 训练起来有点难。

* Thd RNN-transducer’s calculation process includes many obviously unreasonable paths. - 会出来很奇怪的结果。


基于Attention的端到端模型

基于Attention的贡献

This work has greatly promoted speech recognition because it fits well with speech recognition tasks in the following ways:

* Speech recognition is also a sequence-to-sequence process that recognizes the output sequence from the input sequence. So it is essentially the same as translation task. - 跟翻译模型很像,翻译模型效果不错,应该被用到语音识别上。

* The encoder–decoder method using an attention mechanism does not require pre-segment alignment of data. With attention, it can implicitly learn the soft alignment between input and output sequences, which solves a big problem for speech recognition. - 不需要数据对齐。

* Encoding result is no longer limited to a single fixed-length vector, the model can still have a good effect on long input sequence, so it is also possible for such model to handle speech input of various lengths. - 对长文本也有不错的效果。


模型结构

文章接着分别介绍了针对encoder、attention、decoder的一些工作和遇到的问题,这里简单提炼下。

Encoder:

* Delay and Information Redundancy - 因为attention是对整个encoding结果序列做的,所以需要等待整个encoding过程结束才能计算,导致了比较大的延迟。

on the one hand, a longer encoding result sequence means more attention calculation, thereby increasing the delay; on the other hand, since speech is much larger than transcription, the sequence generated by encoding process without sub sampling will introduce a lot of redundant information to the attention mechanism. - 延迟和信息冗余问题。

Attention:

* Continuity Problem - 翻译模型里面两个相邻的单词可能依赖的输入序列相隔很远,而语音识别没有,所以就需要对翻译模型做些改变。

* Monotonic Problem - 语音识别是单向模型。

* Inaccurate Extraction of Key Information - 一旦声音长度比训练数据长度长很多的时候,性能急剧下降。

* Delay - 还是延时问题。

Decoder:

倒没什么大问题。

总的来说,大量的实验显示attention-based模型可以得到比基于CTC和RNN-Transducer的模型更好的结果


比较和总结

文章中有个表格,分别从延时、计算复杂度、语言模型能力、训练难度、识别准确率5个维度对三类模型进行了对比。

理解起来也比较容易,论文有一些解释,这里就略过。

之后还有模型性能的对比,也有个表格:

总的来说,端到端模型还没有完全超越经典的HMM-DNN模型,因为它用了精心设计的语言模型和大量的语言学字典来保证模型的精确性。

However, the HMM-DNN model itself is limited by various unfavorable factors such as independent hypothesis and multi-module individual training inherited from HMM, while the end-to-end model has advantages such as simplified model, joint training, direct output, No need for forced data alignment. Therefore, the end-to-end model is the current focus of LVCSR and an important research direction in the future. - 端到端模型是当前研究热点。


工作展望

指出了两个比较热的研究方向吧,一个是模型的延迟,一个是语言模型能力。

59 views1 comment

©2020 by EasyCSTech. Special thanks to IPinfo​ and EasyNote.