Abstract:
Aiming at the problem that the single channel speech separation method based on feature input of spectrogram has poor separation effect due to the overlap of time-frequency (TF) bins of different speakers, we propose a deep clustering speech separation model based on auditory modulation mechanism. We calculated the modulation signal by frequency band division and envelope detection, and extracted the modulation amplitude spectrum by Fourier transform. The embedding features of modulation amplitude spectrum were extracted by BiLSTM combined with self-attention mechanism. The self-organizing maps algorithm was used to cluster the extracted features and obtain the mask matrix of different speakers, and we reconstructed speech signal. The experimental results show that the PESQ and SDR values of the proposed model on WSJ0-2mix dataset are 3.35 and 9.41dB, improved by 4.36% and 8.79% than the current state-of-the-art method.