自我监督学习的音视频扬声器二值化(Multimedia)

主讲人二值化,即寻找特定主讲人的语音组,在视频会议、人机交互系统等以人为中心的应用中得到了广泛的应用。在这篇论文中,我们提出一种自监督的音视频同步学习方法来解决说话人的二值化问题,而不需要大量的标注工作。我们通过引入两个新的损失函数:动态三重损失和多项损失,改进了以前的方法。我们在真实世界的人机交互系统上进行了测试,结果表明我们的最佳模型获得了显著的+8%的f1分数,并降低了二值化的错误率。最后,我们介绍了一种新的大型音频视频语料库,以填补汉语音频视频数据集的空白。

原文题目:SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION

原文:Speaker diarization, which is to find the speech seg- ments of specific speakers, has been widely used in human- centered applications such as video conferences or human- computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by intro- ducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human- computer interaction system and the results show our best model yields a remarkable gain of +8% F1-scores as well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video dataset in Chinese.

原文作者:Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang

原文链接:https://arxiv.org/abs/2002.05314