自我監督學習的音影片揚聲器二值化(Multimedia)

  • 2020 年 2 月 15 日
  • 筆記

主講人二值化,即尋找特定主講人的語音組,在影片會議、人機交互系統等以人為中心的應用中得到了廣泛的應用。在這篇論文中,我們提出一種自監督的音影片同步學習方法來解決說話人的二值化問題,而不需要大量的標註工作。我們通過引入兩個新的損失函數:動態三重損失和多項損失,改進了以前的方法。我們在真實世界的人機交互系統上進行了測試,結果表明我們的最佳模型獲得了顯著的+8%的f1分數,並降低了二值化的錯誤率。最後,我們介紹了一種新的大型音頻影片語料庫,以填補漢語音頻影片數據集的空白。

原文題目:SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION

原文:Speaker diarization, which is to find the speech seg- ments of specific speakers, has been widely used in human- centered applications such as video conferences or human- computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by intro- ducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human- computer interaction system and the results show our best model yields a remarkable gain of +8% F1-scores as well as diarization error rate reduction. Finally, we introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video dataset in Chinese.

原文作者:Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang

原文鏈接:https://arxiv.org/abs/2002.05314