我的QA模型知道什么? 使用专家知识设计受控探针(CS CompLang)
- 2020 年 1 月 3 日
- 筆記
开放域问答(QA)涉及一些基础知识和推理挑战,但是用基准任务进行测试时,模型是否真的在学习这种知识?为了对此进行调查,我们引入了几个新的挑战任务,以探究最新的QA模型是否具有关于单词定义和一般分类推理的常识,这两者都是复杂推理的基础,并且广泛存在于基准数据集中。作为昂贵的众包的替代方法,我们介绍了一种方法,该方法可从各种类型的专家知识(例如知识图和词汇分类法)自动构建数据集,从而可以对生成的探针进行系统控制并进行更全面的评估。我们发现自动构造的探针容易受到注释伪影的影响,对此我们进行了仔细地控制。我们的评估证实,基于变压器的QA模型的确倾向于识别某些类型的结构词汇知识。但是,这也揭示了一个更加细微的情况:基本分类学层次中的跃点数甚至略有增加,或者随着引入更具挑战性的干扰因素候选答案,它们的性能都会大大降低。此外,即使这些模型在标准实例级别的评估中获得成功,但在语义连接的探针的簇的级别(例如,有关概念的所有Isa问题)进行评估时,它们仍有很大的改进空间。
原文题目:What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge
原文:Open-domain question answering (QA) is known to involve several underlying knowledge and reasoning challenges, but are models actually learning such knowledge when trained on benchmark tasks? To investigate this, we introduce several new challenge tasks that probe whether state-of-the-art QA models have general knowledge about word definitions and general taxonomic reasoning, both of which are fundamental to more complex forms of reasoning and are widespread in benchmark datasets. As an alternative to expensive crowd-sourcing, we introduce a methodology for automatically building datasets from various types of expert knowledge (e.g., knowledge graphs and lexical taxonomies), allowing for systematic control over the resulting probes and for a more comprehensive evaluation. We find automatically constructing probes to be vulnerable to annotation artifacts, which we carefully control for. Our evaluation confirms that transformer-based QA models are already predisposed to recognize certain types of structural lexical knowledge. However, it also reveals a more nuanced picture: their performance degrades substantially with even a slight increase in the number of hops in the underlying taxonomic hierarchy, or as more challenging distractor candidate answers are introduced. Further, even when these models succeed at the standard instance-level evaluation, they leave much room for improvement when assessed at the level of clusters of semantically connected probes (e.g., all Isa questions about a concept).
原文作者:Kyle Richardson, Ashish Sabharwal
原文地址:https://arxiv.org/abs/1912.13337