基于听觉调制机制的深度聚类语音分离算法

徐静娴，关文博，李永伟，赵振，刘扬

doi:10.3969/j.issn.1000-386x.2026.02.038

基于听觉调制机制的深度聚类语音分离算法

徐静娴，关文博，李永伟，赵振，刘扬,

DEEP CLUSTERING SPEECH SEPARATION RESEARCH BASED ON AUDITORY MODULATION MECHANISM

摘要

摘要: 针对基于语谱图特征输入的单通道语音分离方法存在不同说话人时频点重合导致分离效果欠佳的问题，提出一种基于听觉调制机制的深度聚类语音分离模型。通过频带划分和包络检波计算调制信号，利用傅里叶变换提取调制幅度谱；通过BiLSTM神经网络结合自注意力机制提取调制幅度谱嵌入特征；利用自组织映射算法对嵌入特征聚类，获得不同说话人的掩膜矩阵，进而重构语音信号。实验结果表明，提出的模型在WSJ0-2mix数据集上的PESQ和SDR分别达到了3.35和9.41dB，比目前最先进方法分别提高了4.36%和8.79%。

Abstract: Aiming at the problem that the single channel speech separation method based on feature input of spectrogram has poor separation effect due to the overlap of time-frequency (TF) bins of different speakers, we propose a deep clustering speech separation model based on auditory modulation mechanism. We calculated the modulation signal by frequency band division and envelope detection, and extracted the modulation amplitude spectrum by Fourier transform. The embedding features of modulation amplitude spectrum were extracted by BiLSTM combined with self-attention mechanism. The self-organizing maps algorithm was used to cluster the extracted features and obtain the mask matrix of different speakers, and we reconstructed speech signal. The experimental results show that the PESQ and SDR values of the proposed model on WSJ0-2mix dataset are 3.35 and 9.41dB, improved by 4.36% and 8.79% than the current state-of-the-art method.

HTML全文

参考文献(0)

施引文献

资源附件(0)