Abstract:
The text-independent speaker verification system is less effective when the test utterance is shorter. In view of this, a method of enhancing acoustic features is proposed to assist the system. The method used a generation model based on seq2seq to generate longer acoustic features from short-term acoustic features. The generation model included an encoder for extracting deep features and a decoder for outputting acoustic features. It used an attention mechanism to obtain the relationship between sequences and added cosine distance loss to improve the generalization performance of the generation model during training. The trained text-independent speaker verification model was used as a component of the generation model training architecture to help the generation model training. The experimental results show that under the condition of 1-3 seconds of speech duration, the equal error rate of the system is reduced by 7.78% on average after using this model.