Wavesplit:通过说话者聚类实现端到端的语音分离(CS SD)
- 2020 年 3 月 17 日
- 筆記
我们介绍Wavesplit,端到端的语音分离系统。从混合语音的单一记录中,该模型推断和聚集了每个说话者的表征,然后根据推断的表征估计每个源信号。该模型根据原始波形进行训练,共同完成这两项任务。该模型通过聚类的方法推导出一组说话人表示,解决了语音分离的基本排列问题。此外,与以前的方法相比,序列范围的扬声器表示提供了更健壮的长而有挑战性的序列分离。我们证明Wavesplit在2个或3个扬声器(WSJ0-2mix、WSJ0-3mix)的清洁混合上,以及在有噪声(WHAM!)和混响(WHAMR!)的情况下,都比之前的最新技术要好。作为额外的贡献,我们通过引入在线数据增强来进一步改进我们的模型。
原文题目:Wavesplit: End-to-End Speech Separation by Speaker Clustering
原文:We introduce Wavesplit, an end-to-end speech separation system. From a single recording of mixed speech, the model infers and clusters representations of each speaker and then estimates each source signal conditioned on the inferred representations. The model is trained on the raw waveform to jointly perform the two tasks. Our model infers a set of speaker representations through clustering, which addresses the fundamental permutation problem of speech separation. Moreover, the sequence-wide speaker representations provide a more robust separation of long, challenging sequences, compared to previous approaches. We show that Wavesplit outperforms the previous state-of-the-art on clean mixtures of 2 or 3 speakers (WSJ0-2mix, WSJ0-3mix), as well as in noisy (WHAM!) and reverberated (WHAMR!) conditions. As an additional contribution, we further improve our model by introducing online data augmentation for separation.
原文作者:Neil Zeghidour, David Grangier 原文地址:http://cn.arxiv.org/abs/2002.08933