ESPnet学习笔记

安装

链接:https://espnet.github.io/espnet/installation.html

这里最好安装gcc 5.0以上版本,否则pytorch安装会报错。

cuda相关环境变量不一定要配,要看你的目录在哪里,如果是cuda默认目录就不需要配。

个人比较偏爱miniconda安装方式。

这里GPU的使用特别要注意设置export CUDA_VISIBLE_DEVICES=0,1,2,3环境变量,否则跑例子的时候会报错。

如果是多GPU记得安装nccl,否则即使设置ngpu参数,还是只会用一个GPU计算,另外的GPU只会占着内存。


ESPnet使用方法

目录结构:

Always we organize each recipe placed in egs/xxx/asr1 in Kaldi way:

* conf/: kaldi configurations, e.g., speech feature

* data/: almost raw data prepared by Kaldi

* exp/: intermidiate files through experiments, e.g., log files, model parameters

* fbank/: speech feature binary files, e.g., ark, scp

* dump/: ESPnet meta data for tranining, e.g., json, hdf5

* local/: corpus specific data preparation scripts

* steps/, utils/: Kaldi’s helper scripts


例子:AN4

./run.sh --backend chainer

./run.sh --backend pytorch

一般的run脚本会有这么几步:

* Data download

* Data preparation (Kaldi style)

* Feature extraction (Kaldi style)

* Dictionary and JSON format data preparation

* Training based on chainer or pytorch.

* Recognition and scoring

其他信息:

tensorboard --logdir tensorboard

tail -f exp/${expdir}/train.log

./run.sh --stage 3 --stop-stage 5 # 从第3步到底5步

AN4可以很方便地在CTC、Attention、hybrid CTC/attention模型间切换:

# hybrid CTC/attention (default)
#  --mtlalpha 0.5 and --ctc_weight 0.3 in most cases
$ ./run.sh

# CTC mode
$ ./run.sh --mtlalpha 1.0 --ctc_weight 1.0 --recog_model model.loss.best

# attention mode
$ ./run.sh --mtlalpha 0.0 --ctc_weight 0.0 --maxlenratio 0.8 --minlenratio 0.3

* The CTC training mode does not output the validation accuracy, and the optimum model is selected with its loss value (i.e., --recog_model model.loss.best).

* The pure attention mode requires to set the maximum and minimum hypothesis length (--maxlenratio and --minlenratio), appropriately. In general, if you have more insertion errors, you can decrease the maxlenratio value, while if you have more deletion errors you can increase the minlenratio value. Note that the optimum values depend on the ratio of the input frame and output label lengths, which is changed for each language and each BPE unit.

* About the effectiveness of hybrid CTC/attention during training and recognition. For example, hybrid CTC/attention is not sensitive to the above maximum and minimum hypothesis heuristics.


选取模型:espnet/bin/asr_train.py

ctc.py

e2e_asr_mix.py

e2e_asr_mulenc.py

e2e_asr.py

e2e_asr_transducer.py

e2e_asr_transformer.py

e2e_mt.py

e2e_mt_transformer.py

e2e_st.py

e2e_st_transformer.py

e2e_tts_fastspeech.py

e2e_tts_tacotron2.py

e2e_tts_transformer.py

通过--model-module选取,在exp/路径下,如./exp/train_rnnlm_pytorch_lm_word100/model.json


目录结构详解 - TTS

Each recipe has the same structure and files.

* run.sh: Main script of the recipe. Once you run this script, all of the processing will be conducted from data download, preparation, feature extraction, training, and decoding.

* cmd.sh: Command configuration source file about how-to-run each processing. You can modify this script if you want to run the script through job control system e.g. Slurm or Torque.

* path.sh: Path configuration source file. Basically, we do not have to touch.

* conf/: Directory containing configuration files.

* local/: Directory containing the recipe-specific scripts e.g. data preparation.

* steps/ and utils/: Directory containing kaldi tools.


Main script run.sh consists of several stages:

* stage -1: Download data if the data is available online.

* stage 0: Prepare data to make kaldi-stype data directory.

* stage 1: Extract feature vector, calculate statistics, and perform normalization.

* stage 2: Prepare a dictionary and make json files for training.

* stage 3: Train the E2E-TTS network.

* stage 4: Decode mel-spectrogram using the trained network.

* stage 5: Generate a waveform from a generated mel-spectrogram using Griffin-Lim.


Currently, we support the following networks:

- Tacotron2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

- Transformer: Neural Speech Synthesis with Transformer Network

- FastSpeech: FastSpeech: Fast, Robust and Controllable Text to Speech


ASR详解:https://espnet.github.io/espnet/notebook/asr_cli.html

TTS更详细,这里用TTS作讲解,省去生成语音那一部分

第-1阶段:数据下载

./run.sh --stage -1 --stop_stage -1


第0阶段:数据准备

./run.sh --stage 0 --stop_stage 0

会创建kaldi-style的文件目录,简单看下:

data/lang_1char:

train_nodev_units.txt


data/test:

feats.scp  filetype  spk2utt  text  utt2num_frames  utt2spk  wav.scp


data/train:

feats.scp  filetype  spk2utt  text  utt2num_frames  utt2spk  wav.scp


data/train_dev:

feats.scp  spk2utt  text  utt2num_frames  utt2spk  wav.scp


data/train_nodev:

cmvn.ark  feats.scp  spk2utt  text  utt2num_frames  utt2spk  wav.scp

详细介绍在kaldi文件里:http://kaldi-asr.org/doc/data_prep.html

head -n 3 data/train/{wav.scp,text,utt2spk,spk2utt}

==> data/train/wav.scp <==

fash-an251-b /content/espnet/egs/an4/tts1/../../../tools/kaldi/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 ./downloads/an4/wav/an4_clstk/fash/an251-fash-b.sph |

fash-an253-b /content/espnet/egs/an4/tts1/../../../tools/kaldi/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 ./downloads/an4/wav/an4_clstk/fash/an253-fash-b.sph |

fash-an254-b /content/espnet/egs/an4/tts1/../../../tools/kaldi/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 ./downloads/an4/wav/an4_clstk/fash/an254-fash-b.sph |


==> data/train/text <==

fash-an251-b YES

fash-an253-b GO

fash-an254-b YES


==> data/train/utt2spk <==

fash-an251-b fash

fash-an253-b fash

fash-an254-b fash


==> data/train/spk2utt <==

fash fash-an251-b fash-an253-b fash-an254-b fash-an255-b fash-cen1-b fash-cen2-b fash-cen4-b fash-cen5-b fash-cen7-b

fbbh fbbh-an86-b fbbh-an87-b fbbh-an88-b fbbh-an89-b fbbh-an90-b fbbh-cen1-b fbbh-cen2-b fbbh-cen3-b fbbh-cen4-b fbbh-cen5-b fbbh-cen6-b fbbh-cen7-b fbbh-cen8-b

fclc fclc-an146-b fclc-an147-b fclc-an148-b fclc-an149-b fclc-an150-b fclc-cen1-b fclc-cen2-b fclc-cen3-b fclc-cen4-b fclc-cen5-b fclc-cen6-b fclc-cen7-b fclc-cen8-b


Each file contains the following information:

- wav.scp: List of audio path. Each line has <utt_id> <wavfile_path or command pipe>. <utt_id> must be unique.

- text: List of transcriptions. Each line has <utt_id> <transcription>. In the case of TTS, we assume that <transcription> is cleaned.

- utt2spk: List of correspondence table between utterances and speakers. Each line has <utt_id> <speaker_id>. 

- spk2utt: List of correspondence table between speakers and utterances. Each lien has <speaker_id> <utt_id> ... <utt_id>. This file can be automatically created from utt2spk.

In the ESPnet, speaker information is not used for any processing.

Therefore, utt2spk and spk2utt can be a dummy.


第1阶段:特征提取

./run.sh --stage 1 --stop_stage 1 --nj 4

特征会被存储在fbank目录下,分别以ark和scp文件格式存储。ark是二进制文件,scp是语音id和语音文件的映射。

文件中间的数字代表进程ID。

ark文件可以用kaldiio加载:

特征提取之后,data/train目录下会有两个文件,feats.scp是所有fbank/raw_fbank_train.{1..N}.scp的合并,utt2num_frames是每个语音id对应的frame数量。

And data/train/ directory is split into two directory:

- data/train_nodev/: data directory for training

- data/train_dev/: data directory for validation

在data/train_nodev里面还有个文件叫cmvn.ark,它是被计算出来的静态文件,可以认为是均值和方差的二进制存储

Normalzed features for training, validation and evaluation set are dumped in dump/{train_nodev,train_dev,test}/.


第2阶段:准备训练用json文件

./run.sh --stage 2 --stop_stage 2

字典被存在lang_1char中,train_nodev_units.txt主要包含token到index的映射。

json文件被存在dump/{train_nodev,train_dev,test}/data.json中,格式如下:

* “shape”: Shape of the input or output sequence. Here input shape [63, 80] represents the number of frames = 63 and the dimension of mel-spectrogram = 80.

* “text”: Original transcription.

* “token”: Token sequence of original transcription.

* “tokenid” Token id sequence of original transcription, which is converted using the dictionary.


第3阶段:网络训练

./run.sh --stage 3 --stop_stage 3 --train_config conf/train_pytorch_tacotron2_sample.yaml --verbose 1

网络参数见于conf/train_pytroch_tacotron2.yaml,可以自行修改

日志文件:exp/train_*/train.log

模型:exp/train_*/results/

exp/train_*/results/*.png是训练曲线。

exp/train_*/results/att_ws/.png是每个epoch的attention可视化。

exp/train_*/results/model.loss.best包含模型参数

exp/train_*/results/snapshot包含模型模型参数、迭代状态、优化状态

fine-tuning可以这么做:

# resume training from snapshot.ep.2

!./run.sh --stage 3 --stop_stage 3 --train_config conf/train_pytorch_tacotron2_sample.yaml --resume exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/snapshot.ep.2 --verbose 1

支持tensorboard

%load_ext tensorboard

%tensorboard --logdir tensorboard/train_nodev_pytorch_train_pytorch_tacotron2_sample/


第4阶段:网络解码

预处理,生成特征:

./run.sh --stage 4 --stop_stage 4 --nj 8 --train_config conf/train_pytorch_tacotron2_sample.yaml 
ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_model.loss.best_decode/* 

用指定模型进行解码:

./run.sh --stage 4 --stop_stage 4 --nj 8 --train_config conf/train_pytorch_tacotron2_sample.yaml --model snapshot.ep.2 
ls exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/outputs_snapshot.ep.2_decode/* 


ASR的解码过程

加载语音特征:

import json 
import matplotlib.pyplot as plt 
import kaldiio 

# load 10-th speech/text in data.json 
root = "espnet/egs/an4/asr1" 
with open(root + "/dump/test/deltafalse/data.json", "r") as f: 
  test_json = json.load(f)["utts"] 

key, info = list(test_json.items())[10] 

# plot the speech feature 
fbank = kaldiio.load_mat(info["input"][0]["feat"]) 
plt.matshow(fbank.T[::-1]) 
plt.title(key + ": " + info["output"][0]["text"]) 

加载模型进行预测:

import json 
import torch 
import argparse 
from espnet.bin.asr_recog import get_parser 
from espnet.nets.pytorch_backend.e2e_asr import E2E 

root = "espnet/egs/an4/asr1" 
model_dir = root + "/exp/train_nodev_pytorch_train_mtlalpha1.0/results" 

# load model 
with open(model_dir + "/model.json", "r") as f: 
  idim, odim, conf = json.load(f) 
model = E2E(idim, odim, argparse.Namespace(**conf)) 
model.load_state_dict(torch.load(model_dir + "/model.loss.best")) 
model.cpu().eval() 

# recognize speech 
parser = get_parser() 
args = parser.parse_args(["--beam-size", "2", "--ctc-weight", "1.0", "--result-label", "out.json", "--model", ""]) 
result = model.recognize(fbank, args, token_list) 
s = "".join(conf["char_list"][y] for y in result[0]["yseq"]).replace("<eos>", "").replace("<space>", " ").replace("<blank>", "") 

print("groundtruth:", info["output"][0]["text"]) 
print("prediction: ", s) 

输出:

groundtruth: ONE FIVE TWO THREE SIX

prediction:  ONE FIVE TWO THREY SIX


其他:

import os 
import kaldiio 
from IPython.display import Audio 

try: 
  d = os.getcwd() 
  os.chdir(root) 
  sr, wav = kaldiio.load_scp("data/test/wav.scp")[key] 
finally: 
  os.chdir(d) 
Audio(wav, rate=sr) 

443 views9 comments

©2020 by EasyCSTech. Special thanks to IPinfo​ and EasyNote.