观察增强听力:使用图像恢复丢失的语音(Multimedia)

通过视觉语境可以更好地理解语言;基于这个原因,已经有许多尝试使用图像来适应自动语音识别(ASR)的系统。然而,目前的工作已经表明,视觉适应的ASR模型只使用图像作为正则化信号,而完全忽略了它们的语义内容。在这篇论文中,我们提出了一组实验,在这些实验中,我们展示了在有噪声条件下视觉模态的实用性。结果表明,多模态ASR模型可以利用视觉表象对输入声信号中的掩码词进行提取。我们观察到,整合视觉环境可以导致高达35%的掩蔽字恢复的相对改善。这些结果表明,端到端的多模态ASR系统可以通过利用视觉环境来增强对噪声的稳健性。

原文题目:LOOKING ENHANCES LISTENING: RECOVERING MISSING SPEECH USING IMAGES

原文:Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

原文作者:Tejas Srinivasan, Ramon Sanabria, Florian Metze

原文链接:https://arxiv.org/abs/2002.05639