• Abben

ESPnet学习笔记

安装

链接:https://espnet.github.io/espnet/installation.html

这里最好安装gcc 5.0以上版本,否则pytorch安装会报错。

cuda相关环境变量不一定要配,要看你的目录在哪里,如果是cuda默认目录就不需要配。

个人比较偏爱miniconda安装方式。

这里GPU的使用特别要注意设置export CUDA_VISIBLE_DEVICES=0,1,2,3环境变量,否则跑例子的时候会报错。

如果是多GPU记得安装nccl,否则即使设置ngpu参数,还是只会用一个GPU计算,另外的GPU只会占着内存。


ESPnet使用方法

目录结构:

Always we organize each recipe placed in egs/xxx/asr1 in Kaldi way:

* conf/: kaldi configurations, e.g., speech feature

* data/: almost raw data prepared by Kaldi

* exp/: intermidiate files through experiments, e.g., log files, model parameters

* fbank/: speech feature binary files, e.g., ark, scp

* dump/: ESPnet meta data for tranining, e.g., json, hdf5

* local/: corpus specific data preparation scripts

* steps/, utils/: Kaldi’s helper scripts


例子:AN4

./run.sh --backend chainer

./run.sh --backend pytorch

一般的run脚本会有这么几步:

* Data download

* Data preparation (Kaldi style)

* Feature extraction (Kaldi style)

* Dictionary and JSON format data preparation

* Training based on chainer or pytorch.

* Recognition and scoring

其他信息:

tensorboard --logdir tensorboard

tail -f exp/${expdir}/train.log

./run.sh --stage 3 --stop-stage 5 # 从第3步到底5步

AN4可以很方便地在CTC、Attention、hybrid CTC/attention模型间切换:

# hybrid CTC/attention (default)
#  --mtlalpha 0.5 and --ctc_weight 0.3 in most cases
$ ./run.sh

# CTC mode
$ ./run.sh --mtlalpha 1.0 --ctc_weight 1.0 --recog_model model.loss.best

# attention mode
$ ./run.sh --mtlalpha 0.0 --ctc_weight 0.0 --maxlenratio 0.8 --minlenratio 0.3

* The CTC training mode does not output the validation accuracy, and the optimum model is selected with its loss value (i.e., --recog_model model.loss.best).

* The pure attention mode requires to set the maximum and minimum hypothesis length (--maxlenratio and --minlenratio), appropriately. In general, if you have more insertion errors, you can decrease the maxlenratio value, while if you have more deletion errors you can increase the minlenratio value. Note that the optimum values depend on the ratio of the input frame and output label lengths, which is changed for each language and each BPE unit.

* About the effectiveness of hybrid CTC/attention during training and recognition. For example, hybrid CTC/attention is not sensitive to the above maximum and minimum hypothesis heuristics.


选取模型:espnet/bin/asr_train.py

ctc.py

e2e_asr_mix.py

e2e_asr_mulenc.py

e2e_asr.py

e2e_asr_transducer.py

e2e_asr_transformer.py

e2e_mt.py

e2e_mt_transformer.py

e2e_st.py

e2e_st_transformer.py

e2e_tts_fastspeech.py

e2e_tts_tacotron2.py

e2e_tts_transformer.py

通过--model-module选取,在exp/路径下,如./exp/train_rnnlm_pytorch_lm_word100/model.json


目录结构详解 - TTS

Each recipe has the same structure and files.

* run.sh: Main script of the recipe. Once you run this script, all of the processing will be conducted from data download, preparation, feature extraction, training, and decoding.

* cmd.sh: Command configuration source file about how-to-run each processing. You can modify this script if you want to run the script through job control system e.g. Slurm or Torque.

* path.sh: Path configuration source file. Basically, we do not have to touch.

* conf/: Directory containing configuration files.

* local/: Directory containing the recipe-specific scripts e.g. data preparation.

* steps/ and utils/: Directory containing kaldi tools.


Main script run.sh consists of several stages:

* stage -1: Download data if the data is available online.

* stage 0: Prepare data to make kaldi-stype data directory.

* stage 1: Extract feature vector, calculate statistics, and perform normalization.

* stage 2: Prepare a dictionary and make json files for training.

* stage 3: Train the E2E-TTS network.

* stage 4: Decode mel-spectrogram using the trained network.

* stage 5: Generate a waveform from a generated mel-spectrogram using Griffin-Lim.


Currently, we support the following networks:

- Tacotron2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

- Transformer: Neural Speech Synthesis with Transformer Network

- FastSpeech: FastSpeech: Fast, Robust and Controllable Text to Speech


ASR详解:https://espnet.github.io/espnet/notebook/asr_cli.html

TTS更详细,这里用TTS作讲解,省去生成语音那一部分

第-1阶段:数据下载

./run.sh --stage -1 --stop_stage -1


第0阶段:数据准备

./run.sh --stage 0 --stop_stage 0

会创建kaldi-style的文件目录,简单看下:

data/lang_1char:

train_nodev_units.txt


data/test:

feats.scp  filetype  spk2utt  text  utt2num_frames  utt2spk  wav.scp


data/train:

feats.scp  filetype  spk2utt  text  utt2num_frames  utt2spk  wav.scp


data/train_dev:

feats.scp  spk2utt  text  utt2num_frames  utt2spk  wav.scp


data/train_nodev:

cmvn.ark  feats.scp  spk2utt  text  utt2num_frames  utt2spk  wav.scp

详细介绍在kaldi文件里:http://kaldi-asr.org/doc/data_prep.html

head -n 3 data/train/{wav.scp,text,utt2spk,spk2utt}

==> data/train/wav.scp <==

fash-an251-b /content/espnet/egs/an4/tts1/../../../tools/kaldi/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 ./downloads/an4/wav/an4_clstk/fash/an251-fash-b.sph |

fash-an253-b /content/espnet/egs/an4/tts1/../../../tools/kaldi/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 ./downloads/an4/wav/an4_clstk/fash/an253-fash-b.sph |

fash-an254-b /content/espnet/egs/an4/tts1/../../../tools/kaldi/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 ./downloads/an4/wav/an4_clstk/fash/an254-fash-b.sph |


==> data/train/text <==

fash-an251-b YES

fash-an253-b GO