Global-to-Local Neural Networks for Document-Level Relation Extraction 論文閱讀 EMNLP 2020

2021 年 1 月 30 日
AI
自然語言處理

Background

論文動機：

文檔關係抽取的難點在於，在一篇文檔中，其包含多個實體，每個實體（entities）擁有多個處於不同上下文的提及（mentions）。為了識別出跨句子實體之間的關係，抽取模型需要能夠建模文檔中多個實體之間的複雜交互以及綜合利用實體的多個提及的上下文信息。
有一個重要的點：there is still a big gap between word representations and relation prediction. 這也是我一直的疑惑，單詞的表示和關係的預測可以說是關係不大的；
噪聲問題，以往模型無差別地整合所有信息（the irrelevant information would be involved as noise and damages the prediction accuracy）

論文貢獻：

我們需要考慮三個問題

如何建模文檔複雜的語義信息？
- BERT <=> Capture semantic feature and common-sense knowledge
- 啟發式規則構建異構圖建模提及、實體、句子之間的語義交互信息。
如何有效的學習實體的多粒度表示？
- Global representation layer
  實體的全局語義信息使用L層R-GCN來構建的異構圖進行卷積，得到實體的全局表示；
- Local representation layer
  局部表示主要是考慮了實體在不同目標實體對預測時的局部偏向性語義信息；
如何利用文檔的主題信息？
- 這點很有意思啊，文檔的主題信息可以輔助關係的判斷；

Model（loc representation很新穎）

Encoding Layer

Doc = [w_1,w_2,…,w_k] , w_j is the j^{th} word in document.
Global Layer

每個類型的節點都包含着全局語義信息

Construct a global heterogeneous graph, with different types of nodes and edges.

這一思想來自這篇論文「Connecting the dots: Document-level neural relation extraction with edge-oriented graphs」 ：連接不同類型的邊和節點，以此捕獲不同的依賴關係（共現，共指，順序[order dependencies]依賴）

三種類型的節點： mention nodes (M), entity nodes (E), sentence nodes(S);

五種類型的邊： M-M edges, M-E edges, M-S edges, E-S edges, S-S edges.

Different from GCN, R-GCN considers various types of edges and can better model multi-relational graphs.

An L-layer stacked R-GCN:

得到 entity global representations : e_{i}^{glo}
Local Layer（==關鍵==創新）

出發點: 在不同的實體對中，每一個實體都有不同的 local representations.(這個也很好理解，因為即使是同一實體在不同上下文中或者不同實體對中的表示按道理應該是有差異的)

怎麼解決：多頭注意力機制
- Q is related to the entity global rep.（resentations）
- K is related to the inital sentence node rep.
- V is related to the inital mention node rep.
- \mathcal{M}_a 表示實體 a 的相關提及集合；
- \mathcal{S}_a 表示相應的句子節點（which each mention node in \mathcal{M}_a is located）
局部表示建模了不同目標實體對預測時的局部偏向性語義信息，利用多頭注意力機制針對具體實體對有選擇性的聚合多個提及的表示；

直覺上看，如果在同一句話中包含關於實體 a 和實體 b 的mentions ,m_a和 m_b, 那麼提及節點表示（mention node representations n_{m_{a}},n_{m_b}）對最終的預測兩實體 a,b 間的關係有更大的貢獻，那麼在生成e_a^{loc},e_b^{loc}也會賦予更大的注意力權重;

更一般的說，如果包含 m_a 的sentence node 和 e_b^{glo}語義相似度越高（如下紅框）：

表示這一sentence和m_b在語義上越關聯，那麼 n_{m_{a}}對生成的e_a^{loc}的貢獻越大；
Classifier Layer

e = [global rep., local rep.,relative distance rep.]

\delta_{ab}表示實體a 和實體 b的相對距離，後續被統一到一個區間 bin 中；

concatenate the final representation：

document’s 主題信息會暗示可能的關係類型；

Thus we use self-attention to capture context relation representions：

o_i（o_j） is the relation representation of the i^{th} / j^{th} entity pair.

==考慮到一組實體對可能會有不止一類關係，我們將多分類問題轉換為多個二分類問題（multiple binary classifiction）==

Loss function：

Experiment

數據集
- DocRED. / CDR.
實驗結果
實驗發現

實體距離-Entity distance
- GLRE在實體距離（which is defined as the shortest sentence distance between all mentions of two entities）>=3的情況下表現最好
  - global heterogeneous graph : 能夠有效建模不同類型節點之間的語義交互信息；
  - entity local representation：降低長距離上的不同提及帶來的噪聲上下文；
- 實體提及數量- number of entity mentions
  
  對於同一實體，提及的數量越多越好，但是隨着提及數量的增加，帶來的noisy context不可忽視（所以我們能看到隨着實體提及的增加，Ign F1並不是一直在提升/提升很有限）
消融實驗