We’re introducing a new method to separate up to five voices speaking simultaneously on a single microphone. Our method surpasses previous state-of-the-art performance on several speech source separation benchmarks, including ones with challenging noise and reverberations. Using the WSJ0-2mix and WSJ0-3mix data sets, along with newly created variations with four and five simultaneous speakers, our model achieved a scale-invariant SI-SNR (signal-to-noise ratio, a common measure of separation quality) improvement of more than 1.5 dB (decibels) over the current state-of-the-art models.
To build our model, we use a novel recurrent neural network architecture that works directly on the raw audio waveform. Previously best-available models use a mask and a decoder to sort each speaker’s voice. The performance of these kinds of models rapidly degrades when the number of speakers is high or unknown.