This example demonstrates how to create a model to classify speakers from the frequency domain representation of speech recordings, obtained via Fast Fourier Transform (FFT).
It shows the following:
How to use tf.data to load, preprocess and feed audio streams into a model
How to create a 1D convolutional network with residual connections for audio classification.
We prepare a dataset of speech samples from different speakers, with the speaker as label.
We add background noise to these samples to augment our data.
We take the FFT of these samples.
We train a 1D convnet to predict the correct speaker given a noisy FFT speech sample.
This example should be run with TensorFlow 2.3 or higher, or tf-nightly.
The noise samples in the dataset need to be resampled to a sampling rate of 16000 Hz before using the code in this example. In order to do this, you will need to have installed ffmpg.