基于变压器模型的流式自动语音识别(CS SD)

  • 2020 年 1 月 14 日
  • 筆記

基于编码器-解码器的序列-序列模型已经证明了端到端自动语音识别(ASR)的最新成果。最近的研究表明,与基于递归神经网络(RNN)的系统结构相比,基于时间上下文信息的自关注建模的transformer结构能够显著降低单词错误率(WERs)。尽管它取得了成功,但实际应用仅限于离线的ASR任务,因为编码器-解码器架构通常需要整个语音作为输入。在这项工作中,我们提出了一个基于变压器的端到端ASR系统,用于流式ASR,其中输出必须在每个口语单词后不久生成。为了实现这一目标,我们对编码器应用了定时自注意,对编码器和解码器应用了触发注意机制。我们提出的流媒体转换器架构对LibriSpeech的“干净”和“其他”测试数据的准确率分别为2.7%和7.0%,据我们所知,这是本任务中发布的最好的端到端的ASR流媒体结果。

原文题目:Streaming automatic speech recognition with the transformer model

原文:Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.7% and 7.0% WER for the "clean" and "other" test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.

原文作者:Niko Moritz, Takaaki Hori, Jonathan Le Roux

原文地址:https://arxiv.org/abs/2001.02674