作为国际语言学会(ACL)下属的 SIGDAT 小组主办的自然语言处理领域的顶级国际会议。EMNLP每年举办一次,去年则与 IJCNLP 联合在香港举办,今年由于疫情转为线上举办。

EMNLP 2020 共收到3677篇投稿,有效投稿为3359 篇,总接收论文752 篇,包括602篇长论文、150篇短论文。

从接收率看,EMNLP 2020的论文接受率创下近五年新低,为22.4%,其中长论文接收率为 24.6%,短论文接收率为16.6%。


【1】Deconstructing word embedding algorithms


作者:Kian Kenyon-Dean,Edward Newell,Jackie Chi Kit Cheung

备注:EMNLP 2020, 6 pages. arXiv admin note: substantial text overlap with arXiv:1911.13280


摘要:Word embeddings are reliable feature representations of words used to obtainhigh quality results for various NLP applications. Uncontextualized wordembeddings are used in many NLP tasks today, especially in resource-limitedsettings where high memory capacity and GPUs are not available. Given thehistorical success of word embeddings in NLP, we propose a retrospective onsome of the most well-known word embedding algorithms. In this work, wedeconstruct Word2vec, GloVe, and others, into a common form, unveiling some ofthe common conditions that seem to be required for making performant wordembeddings. We believe that the theoretical findings in this paper can providea basis for more informed development of future models.

【2】 RethinkCWS: Is Chinese Word Segmentation a Solved Task?


作者:Jinlan Fu,Pengfei Liu,Qi Zhang,Xuanjing Huang

备注:Accepted by EMNLP 2020


摘要:The performance of the Chinese Word Segmentation (CWS) systems has graduallyreached a plateau with the rapid development of deep neural networks,especially the successful use of large pre-trained models. In this paper, wetake stock of what we have achieved and rethink what’s left in the CWS task.Methodologically, we propose a fine-grained evaluation for existing CWSsystems, which not only allows us to diagnose the strengths and weaknesses ofexisting models (under the in-dataset setting), but enables us to quantify thediscrepancy between different criterion and alleviate the negative transferproblem when doing multi-criteria learning. Strategically, despite not aimingto propose a novel model in this paper, our comprehensive experiments on eightmodels and seven datasets, as well as thorough analysis, could search for somepromising direction for future research. We make all codes publicly availableand release an interface that can quickly evaluate and diagnose user’s models:https:github.comneulabInterpretEval.

【3】 Interpretable Multi-dataset Evaluation for Named Entity Recognition


作者:Jinlan Fu,Pengfei Liu,Graham Neubig

备注:Accepted by EMNLP 2020


摘要:With the proliferation of models for natural language processing tasks, it iseven harder to understand the differences between models and their relativemerits. Simply looking at differences between holistic metrics such asaccuracy, BLEU, or F1 does not tell us why or how particular methods performdifferently and how diverse datasets influence the model design choices. Inthis paper, we present a general methodology for interpretable evaluation forthe named entity recognition (NER) task. The proposed evaluation method enablesus to interpret the differences in models and datasets, as well as theinterplay between them, identifying the strengths and weaknesses of currentsystems. By making our analysis tool available, we make it easy for futureresearchers to run similar analyses and drive progress in this area:https:github.comneulabInterpretEval.

【4】 diagNNose: A Library for Neural Activation Analysis


作者:Jaap Jumelet

备注:Accepted to the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, EMNLP 2020


摘要:In this paper we introduce diagNNose, an open source library for analysingthe activations of deep neural networks. diagNNose contains a wide array ofinterpretability techniques that provide fundamental insights into the innerworkings of neural networks. We demonstrate the functionality of diagNNose witha case study on subject-verb agreement within language models. diagNNose isavailable at https:github.comi-machine-thinkdiagnnose.

【5】 Context-aware Stand-alone Neural Spelling Correction


作者:Xiangci Li,Hairong Liu,Liang Huang

备注:8 pages, 5 tables, 1 figure. Findings of the Association for Computational Linguistics: EMNLP 2020


摘要:Existing natural language processing systems are vulnerable to noisy inputsresulting from misspellings. On the contrary, humans can easily infer thecorresponding correct words from their misspellings and surrounding context.Inspired by this, we address the stand-alone spelling correction problem, whichonly corrects the spelling of each token without additional token insertion ordeletion, by utilizing both spelling information and global contextrepresentations. We present a simple yet powerful solution that jointly detectsand corrects misspellings as a sequence labeling task by fine-turning apre-trained language model. Our solution outperforms the previousstate-of-the-art result by 12.8% absolute F0.5 score.

【6】 doc2dial: A Goal-Oriented Document-Grounded Dialogue Dataset


作者:Song Feng,Hui Wan,Chulaka Gunasekara,Siva Sankalp Patel,Sachindra Joshi,Luis A. Lastras

备注:EMNLP 2020


摘要:We introduce doc2dial, a new dataset of goal-oriented dialogues that aregrounded in the associated documents. Inspired by how the authors composedocuments for guiding end users, we first construct dialogue flows based on thecontent elements that corresponds to higher-level relations across textsections as well as lower-level relations between discourse units within asection. Then we present these dialogue flows to crowd contributors to createconversational utterances. The dataset includes over 4500 annotatedconversations with an average of 14 turns that are grounded in over 450documents from four domains. Compared to the prior document-grounded dialoguedatasets, this dataset covers a variety of dialogue scenes ininformation-seeking conversations. For evaluating the versatility of thedataset, we introduce multiple dialogue modeling tasks and present baselineapproaches.

【7】 Learning from Task Descriptions


作者:Orion Weller,Nicholas Lourie,Matt Gardner,Matthew E. Peters

备注:EMNLP 2020


摘要:Typically, machine learning systems solve new tasks by training on thousandsof examples. In contrast, humans can solve new tasks by reading someinstructions, with perhaps an example or two. To take a step toward closingthis gap, we introduce a framework for developing NLP systems that solve newtasks after reading their descriptions, synthesizing prior work in this area.We instantiate this framework with a new English language dataset, ZEST,structured for task-oriented evaluation on unseen tasks. Formulating taskdescriptions as questions, we ensure each is general enough to apply to manypossible inputs, thus comprehensively evaluating a model’s ability to solveeach task. Moreover, the dataset’s structure tests specific types of systematicgeneralization. We find that the state-of-the-art T5 model achieves a score of12% on ZEST, leaving a significant challenge for NLP researchers.

【8】 A Dataset for Tracking Entities in Open Domain Procedural Text


作者:Niket Tandon,Keisuke Sakaguchi,Bhavana Dalvi Mishra,Dheeraj Rajagopal,Peter Clark,Michal Guerquin,Kyle Richardson,Eduard Hovy

备注:To appear in EMNLP 2020


摘要:We present the first dataset for tracking state changes in procedural textfrom arbitrary domains by using an unrestricted (open) vocabulary. For example,in a text describing fog removal using potatoes, a car window may transitionbetween being foggy, sticky,opaque, and clear. Previous formulations of thistask provide the text and entities involved,and ask how those entities changefor just a small, pre-defined set of attributes (e.g., location), limitingtheir fidelity. Our solution is a new task formulation where given just aprocedural text as input, the task is to generate a set of state changetuples(entity, at-tribute, before-state, after-state)for each step,where theentity, attribute, and state values must be predicted from an open vocabulary.Using crowdsourcing, we create OPENPI1, a high-quality (91.5% coverage asjudged by humans and completely vetted), and large-scale dataset comprising29,928 state changes over 4,050 sentences from 810 procedural real-worldparagraphs from WikiHow.com. A current state-of-the-art generation model onthis task achieves 16.1% F1 based on BLEU metric, leaving enough room for novelmodel architectures.

【9】 ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments


作者:Hyounghun Kim,Abhay Zala,Graham Burri,Hao Tan,Mohit Bansal

备注:EMNLP Findings 2020 (18 pages; extended to Hindi)


摘要:For embodied agents, navigation is an important ability but not an isolatedgoal. Agents are also expected to perform specific tasks after reaching thetarget location, such as picking up objects and assembling them into aparticular arrangement. We combine Vision-and-Language Navigation, assemblingof collected objects, and object referring expression comprehension, to createa novel joint navigation-and-assembly task, named ArraMon. During this task,the agent (similar to a PokeMON GO player) is asked to find and collectdifferent target objects one-by-one by navigating based on natural languageinstructions in a complex, realistic outdoor environment, but then also ARRAngethe collected objects part-by-part in an egocentric grid-layout environment. Tosupport this task, we implement a 3D dynamic environment simulator and collecta dataset (in English; and also extended to Hindi) with human-writtennavigation and assembling instructions, and the corresponding ground truthtrajectories. We also filter the collected instructions via a verificationstage, leading to a total of 7.7K task instances (30.8K instructions andpaths). We present results for several baseline models (integrated and biased)and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performancegap demonstrates that our task is challenging and presents a wide scope forfuture work. Our dataset, simulator, and code are publicly available at:https:arramonunc.github.io

【10】 DORB: Dynamically Optimizing Multiple Rewards with Bandits


作者:Ramakanth Pasunuru,Han Guo,Mohit Bansal

备注:EMNLP 2020 (15 pages)


摘要:Policy gradients-based reinforcement learning has proven to be a promisingapproach for directly optimizing non-differentiable evaluation metrics forlanguage generation tasks. However, optimizing for a specific metric rewardleads to improvements in mostly that metric only, suggesting that the model isgaming the formulation of that metric in a particular way without oftenachieving real qualitative improvements. Hence, it is more beneficial to makethe model optimize multiple diverse metric rewards jointly. While appealing,this is challenging because one needs to manually decide the importance andscaling weights of these metric rewards. Further, it is important to considerusing a dynamic combination and curriculum of metric rewards that flexiblychanges over time. Considering the above aspects, in our work, we automate theoptimization of multiple metric rewards simultaneously via a multi-armed banditapproach (DORB), where at each round, the bandit chooses which metric reward tooptimize next, based on expected arm gains. We use the Exp3 algorithm forbandits and formulate two approaches for bandit rewards: (1) SingleMulti-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit(HM-Bandit). We empirically show the effectiveness of our approaches viavarious automatic metrics and human evaluation on two important NLG tasks:question generation and data-to-text generation, including on an unseen-testtransfer setup. Finally, we present interpretable analyses of the learnedbandit curriculum over the optimized rewards.

【11】 IIRC: A Dataset of Incomplete Information Reading Comprehension Questions


作者:James Ferguson,Matt Gardner,Hannaneh Hajishirzi,Tushar Khot,Pradeep Dasigi

备注:EMNLP 2020


摘要:Humans often have to read multiple documents to address their informationneeds. However, most existing reading comprehension (RC) tasks only focus onquestions for which the contexts provide all the information required to answerthem, thus not evaluating a system’s performance at identifying a potentiallack of sufficient information and locating sources for that information. Tofill this gap, we present a dataset, IIRC, with more than 13K questions overparagraphs from English Wikipedia that provide only partial information toanswer them, with the missing information occurring in one or more linkeddocuments. The questions were written by crowd workers who did not have accessto any of the linked documents, leading to questions that have little lexicaloverlap with the contexts where the answers appear. This process also gave manyquestions without answers, and those that require discrete reasoning,increasing the difficulty of the task. We follow recent modeling work onvarious reading comprehension datasets to construct a baseline model for thisdataset, finding that it achieves 31.1% F1 on this task, while estimated humanperformance is 88.4%. The dataset, code for the baseline system, and aleaderboard can be found at https:allennlp.orgiirc.

【12】 Structural and Functional Decomposition for Personality Image Captioning in a Communication Game


作者:Thu Nguyen,Duy Phung,Minh Hoai,Thien Huu Nguyen

机构*:VinAI Research, Vietnam, University of Information Technology, VNU-HCM, Vietnam, Stony Brook University, Stony Brook,ny,usa, University of Oregon, Eugene, OR , USA

备注:10 pages, EMNLP-Findings 2020

Journal-ref:EMNLP-Findings 2020


摘要:Personality image captioning (PIC) aims to describe an image with a naturallanguage caption given a personality trait. In this work, we introduce a novelformulation for PIC based on a communication game between a speaker and alistener. The speaker attempts to generate natural language captions while thelistener encourages the generated captions to contain discriminativeinformation about the input images and personality traits. In this way, weexpect that the generated captions can be improved to naturally represent theimages and express the traits. In addition, we propose to adapt the languagemodel GPT2 to perform caption generation for PIC. This enables the speaker andlistener to benefit from the language encoding capacity of GPT2. Ourexperiments show that the proposed model achieves the state-of-the-artperformance for PIC.

【13】 Where Are You? Localization from Embodied Dialog


作者:Meera Hahn,Jacob Krantz,Dhruv Batra,Devi Parikh,James M. Rehg,Stefan Lee,Peter Anderson

机构*:Georgia Institute of Technology ,Oregon State University ,Facebook AI Research(FAIR)

Journal-ref:EMNLP 2020


摘要:We present Where Are You? (WAY), a dataset of 6k dialogs in which two humans– an Observer and a Locator — complete a cooperative localization task. TheObserver is spawned at random in a 3D environment and can navigate fromfirst-person views while answering questions from the Locator. The Locator mustlocalize the Observer in a detailed top-down map by asking questions and givinginstructions. Based on this dataset, we define three challenging tasks:Localization from Embodied Dialog or LED (localizing the Observer from dialoghistory), Embodied Visual Dialog (modeling the Observer), and CooperativeLocalization (modeling both agents). In this paper, we focus on the LED task –providing a strong baseline model with detailed ablations characterizing bothdataset biases and the importance of various modeling choices. Our best modelachieves 32.7% success at identifying the Observer’s location within 3m inunseen buildings, vs. 70.4% for human Locators.

【14】 Sequence-Level Mixed Sample Data Augmentation


作者:Demi Guo,Yoon Kim,Alexander M. Rush

机构*:Harvard University, MIT-IBM Watson AI Lab, Cornell University

备注:EMNLP 2020


摘要:Despite their empirical success, neural networks still have difficultycapturing compositional aspects of natural language. This work proposes asimple data augmentation approach to encourage compositional behavior in neuralmodels for sequence-to-sequence problems. Our approach, SeqMix, creates newsynthetic examples by softly combining inputoutput sequences from the trainingset. We connect this approach to existing techniques such as SwitchOut and worddropout, and show that these techniques are all approximating variants of asingle objective. SeqMix consistently yields approximately 1.0 BLEU improvementon five different translation datasets over strong Transformer baselines. Ontasks that require strong compositional generalization such as SCAN andsemantic parsing, SeqMix also offers further improvements.

【15】 Relation Extraction with Contextualized Relation Embedding (CRE)


作者:Xiaoyu Chen,Rohan Badlani

机构*:Dept. of Computer Science, Stanford University

备注:EMNLP 2020 Workshop: Deep Learning Inside Out (DeeLIO)


摘要:Relation extraction is the task of identifying relation instance between twoentities given a corpus whereas Knowledge base modeling is the task ofrepresenting a knowledge base, in terms of relations between entities. Thispaper proposes an architecture for the relation extraction task that integratessemantic information with knowledge base modeling in a novel manner. Existingapproaches for relation extraction either do not utilize knowledge basemodelling or use separately trained KB models for the RE task. We present amodel architecture that internalizes KB modeling in relation extraction. Thismodel applies a novel approach to encode sentences into contextualized relationembeddings, which can then be used together with parameterized entityembeddings to score relation instances. The proposed CRE model achieves stateof the art performance on datasets derived from The New York Times AnnotatedCorpus and FreeBase. The source code has been made available.

【16】 Exploring Text Specific and Blackbox Fairness Algorithms in Multimodal Clinical NLP


作者:John Chen,Ian Berlot-Atwell,Safwan Hossain,Xindi Wang,Frank Rudzicz

机构*: University of Toronto, Vector Institute, University of Western Ontario, St. Michael’s Hospital

备注:Best paper award at 3rd Clinical Natural Language Processing Workshop at EMNLP 2020

Journal-ref:Proceedings of the 3rd Clinical Natural Language Processing Workshop (2020), pages 301–312


摘要:Clinical machine learning is increasingly multimodal, collected in bothstructured tabular formats and unstructured forms such as freetext. We proposea novel task of exploring fairness on a multimodal clinical dataset, adoptingequalized odds for the downstream medical prediction tasks. To this end, weinvestigate a modality-agnostic fairness algorithm – equalized odds postprocessing – and compare it to a text-specific fairness algorithm: debiasedclinical word embeddings. Despite the fact that debiased word embeddings do notexplicitly address equalized odds of protected groups, we show that atext-specific approach to fairness may simultaneously achieve a good balance ofperformance and classical notions of fairness. We hope that our paper inspiresfuture contributions at the critical intersection of clinical NLP and fairness.The full source code is available here:https:github.comjohntiger1multimodal_fairness