我的QA模型知道什麼? 使用專家知識設計受控探針(CS CompLang)

開放域問答(QA)涉及一些基礎知識和推理挑戰,但是用基準任務進行測試時,模型是否真的在學習這種知識?為了對此進行調查,我們引入了幾個新的挑戰任務,以探究最新的QA模型是否具有關於單詞定義和一般分類推理的常識,這兩者都是複雜推理的基礎,並且廣泛存在於基準數據集中。作為昂貴的眾包的替代方法,我們介紹了一種方法,該方法可從各種類型的專家知識(例如知識圖和辭彙分類法)自動構建數據集,從而可以對生成的探針進行系統控制並進行更全面的評估。我們發現自動構造的探針容易受到注釋偽影的影響,對此我們進行了仔細地控制。我們的評估證實,基於變壓器的QA模型的確傾向於識別某些類型的結構辭彙知識。但是,這也揭示了一個更加細微的情況:基本分類學層次中的躍點數甚至略有增加,或者隨著引入更具挑戰性的干擾因素候選答案,它們的性能都會大大降低。此外,即使這些模型在標準實例級別的評估中獲得成功,但在語義連接的探針的簇的級別(例如,有關概念的所有Isa問題)進行評估時,它們仍有很大的改進空間。

原文題目:What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge

原文:Open-domain question answering (QA) is known to involve several underlying knowledge and reasoning challenges, but are models actually learning such knowledge when trained on benchmark tasks? To investigate this, we introduce several new challenge tasks that probe whether state-of-the-art QA models have general knowledge about word definitions and general taxonomic reasoning, both of which are fundamental to more complex forms of reasoning and are widespread in benchmark datasets. As an alternative to expensive crowd-sourcing, we introduce a methodology for automatically building datasets from various types of expert knowledge (e.g., knowledge graphs and lexical taxonomies), allowing for systematic control over the resulting probes and for a more comprehensive evaluation. We find automatically constructing probes to be vulnerable to annotation artifacts, which we carefully control for. Our evaluation confirms that transformer-based QA models are already predisposed to recognize certain types of structural lexical knowledge. However, it also reveals a more nuanced picture: their performance degrades substantially with even a slight increase in the number of hops in the underlying taxonomic hierarchy, or as more challenging distractor candidate answers are introduced. Further, even when these models succeed at the standard instance-level evaluation, they leave much room for improvement when assessed at the level of clusters of semantically connected probes (e.g., all Isa questions about a concept).

原文作者:Kyle Richardson, Ashish Sabharwal

原文地址:https://arxiv.org/abs/1912.13337