基于协商模型的双通道端到端语音识别(CS SC)

  • 2020 年 3 月 27 日
  • 筆記

端到端(E2E)模式在自动语音识别(ASR)领域取得了长足的进步,与传统模式相比具有一定的竞争力。为了进一步提高质量,我们提出了一个双通模型,使用非流媒体听、听和拼写(LAS)模型对流式假设进行重新扫描,同时保持合理的延迟。该模型关注声学,以重新核心假设,而不是只使用第一遍文本假设的神经修正模型。在这项工作中,我们建议同时考虑声学和第一次通过假设使用审议网络。双向编码器用于从第一遍假设中提取上下文信息。在谷歌语音搜索(VS)任务中,我们提出的审议模型与LAS重取相比降低了12%的相对WER,在专有名词测试集上降低了23%。与大型传统模型相比,我们的最佳模型在计算复杂度方面比LAS模型提高了21%,审议译码器比LAS译码器更大,因此需要在第二步解码中进行更多的计算。

原文题目:Deliberation Model Based Two-Pass End-to-End Speech Recognition

原文:End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network. A bidirectional encoder is used to extract context information from first-pass hypotheses. The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set. Compared to a large conventional model, our best model performs 21% relatively better for VS. In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding.

原文作者:Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prabhavalkar

原文地址:https://arxiv.org/abs/2003.07962