­

调查计算语言文档双语方法中的语言影响

对于濒危语言而言,数据收集活动必须能够应对很多数据源自口传而且生产副本费用高昂的挑战。因此,为了确保录音的可解释性,至少要将这些录音转译成使用广泛的语言版本。本文中,我们对翻译语言的选择如何影响记录后的工作以及可能的自动方法方面进行了研究,这些自动方法会影响产生的双语语料库。为了解翻译语言选择对这些工作和方法的影响,我们采用MaSS多语言语音语料库(Boito等人,2020)创建了56个双语对并将这些双语对应用到了资源缺乏的无监管词切分和词切分任务中。研究结果中重点强调了翻译语言的选择对词切分性能的影响而且利用不同的已对齐译文会学到不同的词汇。最后,本文提出了一种双语词切分的混合方法,这种方法将从非参数贝叶斯模型中摘录的范围提示(Goldwater等人,2009a)与Godard等人(2018)的注意词切分网络模型组合在一起。研究结果表明,将这些提示整合到网络模型的输入表示中能够提高翻译和对齐质量,尤其是非常复杂的语言对。

原文标题:Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

For endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation language affects the posterior documentation work and potential automatic approaches which will work on top of the produced bilingual corpus. For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment. Our results highlight that the choice of language for translation influences the word segmentation performance, and that different lexicons are learned by using different aligned translations. Lastly, this paper proposes a hybrid approach for bilingual word segmentation, combining boundary clues extracted from a non-parametric Bayesian model (Goldwater et al., 2009a) with the attentional word segmentation neural model from Godard et al. (2018). Our results suggest that incorporating these clues into the neural models' input representation increases their translation and alignment quality, specially for challenging language pairs.

原文作者:Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier

原文链接:https://arxiv.org/abs/2003.13325