一篇比较好的综述文章,非常适合入门:An Overview of End-to-End Automatic Speech Recognition
简单的描述,通俗理解一下:语音识别技术原理
语音信号处理:
Speech Feature Extraction Techniques: A Review
系统:
ASRT中文语音识别:ASRT(github: A Deep-Learning-Based Chinese Speech Recognition System)
基于PaddlePaddle的DeepSpeech实现:DeepSpeech2(tensorflow实现)
MASR 中文语音识别:MASR
kaldi系统:kaldi
espnet系统:espnet
wav2letter++:wav2letter++
States-of-the-art更新:wer_are_wei
OpenSLR数据集(包含中英文):OpenSLR
暂时就这么多,持续更新...
Data Sets 数据集
清华大学THCHS30中文语音数据集 data_thchs30.tgz OpenSLR国内镜像 OpenSLR国外镜像 test-noise.tgz OpenSLR国内镜像 OpenSLR国外镜像 resource.tgz OpenSLR国内镜像 OpenSLR国外镜像
Free ST Chinese Mandarin Corpus ST-CMDS-20170001_1-OS.tar.gz OpenSLR国内镜像 OpenSLR国外镜像
AIShell-1 开源版数据集 data_aishell.tgz OpenSLR国内镜像 OpenSLR国外镜像
resource_aishell.tgz OpenSLR国内镜像 OpenSLR国外镜像 注:数据集解压方法 $ tar xzf data_aishell.tgz $ cd data_aishell/wav $ for tar in *.tar.gz; do tar xvf $tar; done
Primewords Chinese Corpus Set 1 primewords_md_2018_set1.tar.gz OpenSLR国内镜像 OpenSLR国外镜像
aidatatang_200zh aidatatang_200zh.tgz OpenSLR国内镜像 OpenSLR国外镜像
MagicData train_set.tar.gz OpenSLR国内镜像 OpenSLR国外镜像 dev_set.tar.gz OpenSLR国内镜像 OpenSLR国外镜像 test_set.tar.gz OpenSLR国内镜像 OpenSLR国外镜像 metadata.tar.gz OpenSLR国内镜像 OpenSLR国外镜像
国内镜像有点慢,如果有国外机器的话在国外机器wget国外镜像再传回国内会快很多
一般公司有国外线路直接用国外镜像,可以到1M多每秒
Kaldi模型下载地址:http://kaldi-asr.org/models.html
从语言、上手难度、速度、是否开源、支持中文、预训练中文准确率、支持模型、参与人数、文档、部署难度几个维度简单比较下wav2letter++/Kaldi/ESPnet/DeepSpeech2这几个toolkit:
付费中文数据集:HKUST
https://catalog.ldc.upenn.edu/LDC2005S15, HKUST Mandarin Telephone Speech, Part 1
https://catalog.ldc.upenn.edu/LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
还挺贵的
HKUST Mandarin Telephone Speech, Part 1 was developed by Hong Kong University of Science and Technology (HKUST). In 2004, HKUST was contracted to collect and transcribe 200 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China under the DARPA EARS framework. The first 50 hours of speech and transcripts were released in June 2004 to the EARS community for the RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of collection into training, development and evaluation sets. This release contains the training and development sets with 873 and 24 calls, respectively.
ESPnet几个Transformer中文预训练模型:
HKUST
AISHELL
AISHELL2
想跑通的话还有几处要修改:
data/lang_1char/train_sp_units.txt,可以从model.json里面拷出来(不要自己生成),但是要注意顺序,并且要把eos和blank删掉
run.sh,可以一步一步执行,例如./run.sh --stage 5 --stop_stage 5,decode的话把3、4跳过
注意看报错,改起来不难,主要是decode现在只能用cpu,跑起来很慢😭
HKUST的语音格式是alaw,采样频率8000,双声道:
sox -V vm-intro.wav -r 8000 -c 2 -t ul vm-intro.ulaw
sox -V vm-intro.wav -r 8000 -c 2 $outputdir/vm-intro.wav
sox: SoX v14.4.1
sox INFO formats: detected file format type `wav'
Input File : '/data/LDC2005S15.wav'
Channels : 2
Sample Rate : 8000
Precision : 13-bit
Duration : 00:02:00.00 = 960000 samples ~ 9000 CDDA sectors
File Size : 1.92M
Bit Rate : 128k
Sample Encoding: 8-bit A-law
Endian Type : little
Reverse Nibbles: no
Reverse Bits : no
Output File : '' (null)
Channels : 2
Sample Rate : 8000
Precision : 13-bit
Duration : 00:02:00.00 = 960000 samples ~ 9000 CDDA sectors
sox INFO sox: effects chain: input 8000Hz 2 channels
sox INFO sox: effects chain: output 8000Hz 2 channels
AISHELL语音格式是wav,采样频率16000,单声道:
sox vm-intro.wav -r 16000 -c 1 $outputdir/vm-intro.wav
语音识别中的标点符号添加,科大讯飞在11年的时候发表过一篇专利:实现语音识别中自动添加标点符号的方法及系统 ,这里贴幅图,给个参考:
在E2E模型中这个方法似乎不太好实现,可不可以直接根据停顿切分语音,针对单片语音+context进行预测,出来的结果就是带标点的
还可以直接通过语言模型解决,可以参考:Punctuation Restoration With Recurrent Neural Networks
语音基础资料,特别推荐:https://wiki.aalto.fi/display/ITSP/Preface
Preface
Introduction
Basic representations and models
Pre-processing
Modelling tools in speech processing
Speech analysis
Speech enhancement
Transmission, storage and telecommunication
Recognition tasks in speech processing
Speech analysis and imaging for medical applications
References
Introduction to the acoustic analysis of speech
Evaluation of speech processing methods
Computational models of speech perception and language acquisition
Speech Recognition
Speaker Diarization
Test-organization
Applications and systems structures
Security and privacy in speech technology
Python音频特征提取包:https://github.com/novoic/surfboard
CTC端到端语音识别&语料库:https://github.com/Diamondfan/CTC_pytorch?u=1402400261&m=4511380276979663&cu=1968044071
zhvoice: Chinese voice corpus. 中文语音语料,语音更加清晰自然,包含8个开源数据集,3200个说话人,900小时语音,1300万字。https://github.com/KuangDD/zhvoice
语音增强相关资源大列表:https://github.com/nanahou/Awesome-Speech-Enhancement
基于RNN-Transducer的在线语音识别系统:https://github.com/theblackcat102/Online-Speech-Recognition