觀察增強聽力:使用圖像恢復丟失的語音（Multimedia）

2020 年 2 月 15 日
筆記

通過視覺語境可以更好地理解語言;基於這個原因，已經有許多嘗試使用圖像來適應自動語音識別(ASR)的系統。然而，目前的工作已經表明，視覺適應的ASR模型只使用圖像作為正則化信號，而完全忽略了它們的語義內容。在這篇論文中，我們提出了一組實驗，在這些實驗中，我們展示了在有噪聲條件下視覺模態的實用性。結果表明，多模態ASR模型可以利用視覺表象對輸入聲信號中的掩碼詞進行提取。我們觀察到，整合視覺環境可以導致高達35%的掩蔽字恢復的相對改善。這些結果表明，端到端的多模態ASR系統可以通過利用視覺環境來增強對噪聲的穩健性。

原文題目：LOOKING ENHANCES LISTENING: RECOVERING MISSING SPEECH USING IMAGES

原文：Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

原文作者：Tejas Srinivasan, Ramon Sanabria, Florian Metze

原文鏈接：https://arxiv.org/abs/2002.05639

使用圖像恢復丟失的語音.pdf

觀察增強聽力:使用圖像恢復丟失的語音（Multimedia）

VirMach 便宜 VPS

QNews

觀察增強聽力:使用圖像恢復丟失的語音（Multimedia）

分享此文：

Related Posts

手把手教會將 Windows 窗體桌面應用從.NET Framework遷移到 .NET SDK/.NET 6 格式

AngularJS 遺留項目的升級改造之路（一）

高效、可擴展的神經殘差波形編碼與協同量化（Multimedia）

京東：疫情期間照常供應了1.2億件、超16萬噸民生產品

VirMach 便宜 VPS

QNews

熱門文章

熱門搜尋