觀察增強聽力:使用圖像恢復丟失的語音(Multimedia)

  • 2020 年 2 月 15 日
  • 筆記

通過視覺語境可以更好地理解語言;基於這個原因,已經有許多嘗試使用圖像來適應自動語音識別(ASR)的系統。然而,目前的工作已經表明,視覺適應的ASR模型只使用圖像作為正則化信號,而完全忽略了它們的語義內容。在這篇論文中,我們提出了一組實驗,在這些實驗中,我們展示了在有噪聲條件下視覺模態的實用性。結果表明,多模態ASR模型可以利用視覺表象對輸入聲信號中的掩碼詞進行提取。我們觀察到,整合視覺環境可以導致高達35%的掩蔽字恢復的相對改善。這些結果表明,端到端的多模態ASR系統可以通過利用視覺環境來增強對噪聲的穩健性。

原文題目:LOOKING ENHANCES LISTENING: RECOVERING MISSING SPEECH USING IMAGES

原文:Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

原文作者:Tejas Srinivasan, Ramon Sanabria, Florian Metze

原文鏈接:https://arxiv.org/abs/2002.05639