diff --git a/data/xml/2020.acl.xml b/data/xml/2020.acl.xml
index 906cecb6ee..a77eb185f7 100644
--- a/data/xml/2020.acl.xml
+++ b/data/xml/2020.acl.xml
@@ -200,6 +200,7 @@
       <abstract>Cross-modal language generation tasks such as image captioning are directly hurt in their ability to support non-English languages by the trend of data-hungry models combined with the lack of non-English annotations. We investigate potential solutions for combining existing language-generation annotations in English with translation capabilities in order to create solutions at web-scale in both domain and language coverage. We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations (gold data) as well as their machine-translated versions (silver data); at run-time, it generates first an English caption and then a corresponding target-language caption. We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages, under a large-domain testset using images from the Open Images dataset. Furthermore, we find an interesting effect where the English captions generated by the PLuGS models are better than the captions generated by the original, monolingual English model.</abstract>
       <url hash="e7036986">2020.acl-main.16</url>
       <doi>10.18653/v1/2020.acl-main.16</doi>
+      <attachment type="Dataset" hash="d48d3f07">2020.acl-main.16.Dataset.pdf</attachment>
     </paper>
     <paper id="17">
       <title><fixed-case>F</fixed-case>act-based <fixed-case>T</fixed-case>ext <fixed-case>E</fixed-case>diting</title>
@@ -666,6 +667,7 @@
       <abstract>Existing end-to-end dialog systems perform less effectively when data is scarce. To obtain an acceptable success in real-life online services with only a handful of training examples, both fast adaptability and reliable performance are highly desirable for dialog systems. In this paper, we propose the Meta-Dialog System (MDS), which combines the advantages of both meta-learning approaches and human-machine collaboration. We evaluate our methods on a new extended-bAbI dataset and a transformed MultiWOZ dataset for low-resource goal-oriented dialog learning. Experimental results show that MDS significantly outperforms non-meta-learning baselines and can achieve more than 90% per-turn accuracies with only 10 dialogs on the extended-bAbI dataset.</abstract>
       <url hash="92f8ecf6">2020.acl-main.57</url>
       <doi>10.18653/v1/2020.acl-main.57</doi>
+      <attachment type="Dataset" hash="8c68b113">2020.acl-main.57.Dataset.zip</attachment>
     </paper>
     <paper id="58">
       <title>Learning to Tag <fixed-case>OOV</fixed-case> Tokens by Integrating Contextual Representation and Background Knowledge</title>
@@ -796,6 +798,7 @@
       <url hash="39009e68">2020.acl-main.69</url>
       <attachment type="Source" hash="7a4686f8">2020.acl-main.69.Source.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.69</doi>
+      <attachment type="Dataset" hash="89f92503">2020.acl-main.69.Dataset.pdf</attachment>
     </paper>
     <paper id="70">
       <title>An Online Semantic-enhanced <fixed-case>D</fixed-case>irichlet Model for Short Text Stream Clustering</title>
@@ -973,6 +976,7 @@
       <abstract>In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database, called Operation Trees (OT). This representation allows us to invert the annotation process without loosing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of the tokens to the operations. Thus, we randomly generate OTs from a context free grammar and annotators just have to write the appropriate question and assign the tokens. We compare our corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases, to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our dataset is a challenging dataset and that the token alignment can be leveraged to significantly increase the performance.</abstract>
       <url hash="4585244f">2020.acl-main.84</url>
       <doi>10.18653/v1/2020.acl-main.84</doi>
+      <attachment type="Dataset" hash="96642285">2020.acl-main.84.Dataset.zip</attachment>
     </paper>
     <paper id="85">
       <title>Contextualized Sparse Representations for Real-Time Open-Domain Question Answering</title>
@@ -1266,6 +1270,7 @@
       <abstract>The patterns in which the syntax of different languages converges and diverges are often used to inform work on cross-lingual transfer. Nevertheless, little empirical work has been done on quantifying the prevalence of different syntactic divergences across language pairs. We propose a framework for extracting divergence patterns for any language pair from a parallel corpus, building on Universal Dependencies. We show that our framework provides a detailed picture of cross-language divergences, generalizes previous approaches, and lends itself to full automation. We further present a novel dataset, a manually word-aligned subset of the Parallel UD corpus in five languages, and use it to perform a detailed corpus study. We demonstrate the usefulness of the resulting analysis by showing that it can help account for performance patterns of a cross-lingual parser.</abstract>
       <url hash="5743898b">2020.acl-main.109</url>
       <doi>10.18653/v1/2020.acl-main.109</doi>
+      <attachment type="Dataset" hash="7a70d358">2020.acl-main.109.Dataset.zip</attachment>
     </paper>
     <paper id="110">
       <title>Generating Counter Narratives against Online Hate Speech: Data and Strategies</title>
@@ -1379,6 +1384,7 @@
       <abstract>We consider the distinction between intended and perceived sarcasm in the context of textual sarcasm detection. The former occurs when an utterance is sarcastic from the perspective of its author, while the latter occurs when the utterance is interpreted as sarcastic by the audience. We show the limitations of previous labelling methods in capturing intended sarcasm and introduce the iSarcasm dataset of tweets labeled for sarcasm directly by their authors. Examining the state-of-the-art sarcasm detection models on our dataset showed low performance compared to previously studied datasets, which indicates that these datasets might be biased or obvious and sarcasm could be a phenomenon under-studied computationally thus far. By providing the iSarcasm dataset, we aim to encourage future NLP research to develop methods for detecting sarcasm in text as intended by the authors of the text, not as labeled under assumptions that we demonstrate to be sub-optimal.</abstract>
       <url hash="77008987">2020.acl-main.118</url>
       <doi>10.18653/v1/2020.acl-main.118</doi>
+      <attachment type="Dataset" hash="b5d9ee54">2020.acl-main.118.Dataset.zip</attachment>
     </paper>
     <paper id="119">
       <title><fixed-case>AMR</fixed-case> Parsing via Graph-Sequence Iterative Inference</title>
@@ -1512,6 +1518,7 @@
       <abstract>Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques. Given a context, current systems are able to yield a relevant and fluent response, but sometimes make logical mistakes because of weak reasoning capabilities. To facilitate the conversation reasoning research, we introduce MuTual, a novel dataset for Multi-Turn dialogue Reasoning, consisting of 8,860 manually annotated dialogues based on Chinese student English listening comprehension exams. Compared to previous benchmarks for non-task oriented dialogue systems, MuTual is much more challenging since it requires a model that be able to handle various reasoning problems. Empirical results show that state-of-the-art methods only reach 71%, which is far behind human performance of 94%, indicating that there is ample room for improving reasoning ability.</abstract>
       <url hash="c07a815f">2020.acl-main.130</url>
       <doi>10.18653/v1/2020.acl-main.130</doi>
+      <attachment type="Dataset" hash="610fcd42">2020.acl-main.130.Dataset.zip</attachment>
     </paper>
     <paper id="131">
       <title>You Impress Me: Dialogue Generation via Mutual Persona Perception</title>
@@ -1777,6 +1784,7 @@
       <abstract>The main goal of machine translation has been to convey the correct content. Stylistic considerations have been at best secondary. We show that as a consequence, the output of three commercial machine translation systems (Bing, DeepL, Google) make demographically diverse samples from five languages “sound” older and more male than the original. Our findings suggest that translation models reflect demographic bias in the training data. This opens up interesting new research avenues in machine translation to take stylistic considerations into account.</abstract>
       <url hash="fd53762d">2020.acl-main.154</url>
       <doi>10.18653/v1/2020.acl-main.154</doi>
+      <attachment type="Dataset" hash="a3fc9661">2020.acl-main.154.Dataset.zip</attachment>
     </paper>
     <paper id="155">
       <title><fixed-case>MMPE</fixed-case>: <fixed-case>A</fixed-case> <fixed-case>M</fixed-case>ulti-<fixed-case>M</fixed-case>odal <fixed-case>I</fixed-case>nterface for <fixed-case>P</fixed-case>ost-<fixed-case>E</fixed-case>diting <fixed-case>M</fixed-case>achine <fixed-case>T</fixed-case>ranslation</title>
@@ -1836,6 +1844,7 @@
       <abstract>Can artificial neural networks learn to represent inflectional morphology and generalize to new words as human speakers do? Kirov and Cotterell (2018) argue that the answer is yes: modern Encoder-Decoder (ED) architectures learn human-like behavior when inflecting English verbs, such as extending the regular past tense form /-(e)d/ to novel words. However, their work does not address the criticism raised by Marcus et al. (1995): that neural models may learn to extend not the regular, but the most frequent class — and thus fail on tasks like German number inflection, where infrequent suffixes like /-s/ can still be productively generalized. To investigate this question, we first collect a new dataset from German speakers (production and ratings of plural forms for novel nouns) that is designed to avoid sources of information unavailable to the ED model. The speaker data show high variability, and two suffixes evince ‘regular’ behavior, appearing more often with phonologically atypical inputs. Encoder-decoder models do generalize the most frequently produced plural class, but do not show human-like variability or ‘regular’ extension of these other plural markers. We conclude that modern neural models may still struggle with minority-class generalization.</abstract>
       <url hash="0de29480">2020.acl-main.159</url>
       <doi>10.18653/v1/2020.acl-main.159</doi>
+      <attachment type="Dataset" hash="48b36aae">2020.acl-main.159.Dataset.zip</attachment>
     </paper>
     <paper id="160">
       <title>Overestimation of Syntactic Representation in Neural Language Models</title>
@@ -2130,6 +2139,7 @@
       <abstract>The Natural Language Understanding (NLU) component in task oriented dialog systems processes a user’s request and converts it into structured information that can be consumed by downstream components such as the Dialog State Tracker (DST). This information is typically represented as a semantic frame that captures the intent and slot-labels provided by the user. We first show that such a shallow representation is insufficient for complex dialog scenarios, because it does not capture the recursive nature inherent in many domains. We propose a recursive, hierarchical frame-based representation and show how to learn it from data. We formulate the frame generation task as a template-based tree decoding task, where the decoder recursively generates a template and then fills slot values into the template. We extend local tree-based loss functions with terms that provide global supervision and show how to optimize them end-to-end. We achieve a small improvement on the widely used ATIS dataset and a much larger improvement on a more complex dataset we describe here.</abstract>
       <url hash="d32f3925">2020.acl-main.186</url>
       <doi>10.18653/v1/2020.acl-main.186</doi>
+      <attachment type="Dataset" hash="e58c34e2">2020.acl-main.186.Dataset.zip</attachment>
     </paper>
     <paper id="187">
       <title>Speak to your Parser: Interactive Text-to-<fixed-case>SQL</fixed-case> with Natural Language Feedback</title>
@@ -2392,6 +2402,7 @@
       <url hash="c10d4f66">2020.acl-main.210</url>
       <attachment type="Software" hash="6f730f3c">2020.acl-main.210.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.210</doi>
+      <attachment type="Dataset" hash="c0ce93a5">2020.acl-main.210.Dataset.zip</attachment>
     </paper>
     <paper id="211">
       <title>Interactive Machine Comprehension with Information Seeking Agents</title>
@@ -2485,6 +2496,7 @@
       <abstract>Effective dialogue involves grounding, the process of establishing mutual knowledge that is essential for communication between people. Modern dialogue systems are not explicitly trained to build common ground, and therefore overlook this important aspect of communication. Improvisational theater (improv) intrinsically contains a high proportion of dialogue focused on building common ground, and makes use of the yes-and principle, a strong grounding speech act, to establish coherence and an actionable objective reality. We collect a corpus of more than 26,000 yes-and turns, transcribing them from improv dialogues and extracting them from larger, but more sparsely populated movie script dialogue corpora, via a bootstrapped classifier. We fine-tune chit-chat dialogue systems with our corpus to encourage more grounded, relevant conversation and confirm these findings with human evaluations.</abstract>
       <url hash="cf440e63">2020.acl-main.218</url>
       <doi>10.18653/v1/2020.acl-main.218</doi>
+      <attachment type="Dataset" hash="ed42bd51">2020.acl-main.218.Dataset.zip</attachment>
     </paper>
     <paper id="219">
       <title>Image-Chat: Engaging Grounded Conversations</title>
@@ -2706,6 +2718,7 @@
       <abstract>We study the potential for interaction in natural language classification. We add a limited form of interaction for intent classification, where users provide an initial query using natural language, and the system asks for additional information using binary or multi-choice questions. At each turn, our system decides between asking the most informative question or making the final classification pre-diction. The simplicity of the model allows for bootstrapping of the system without interaction data, instead relying on simple crowd-sourcing tasks. We evaluate our approach on two domains, showing the benefit of interaction and the advantage of learning to balance between asking additional questions and making the final prediction.</abstract>
       <url hash="4ab62374">2020.acl-main.237</url>
       <doi>10.18653/v1/2020.acl-main.237</doi>
+      <attachment type="Dataset" hash="e9c236f0">2020.acl-main.237.Dataset.pdf</attachment>
     </paper>
     <paper id="238">
       <title>Knowledge Graph Embedding Compression</title>
@@ -2886,6 +2899,7 @@
       <abstract>Back-translation is a widely used data augmentation technique which leverages target monolingual data. However, its effectiveness has been challenged since automatic metrics such as BLEU only show significant improvements for test examples where the source itself is a translation, or translationese. This is believed to be due to translationese inputs better matching the back-translated training data. In this work, we show that this conjecture is not empirically supported and that back-translation improves translation quality of both naturally occurring text as well as translationese according to professional human translators. We provide empirical evidence to support the view that back-translation is preferred by humans because it produces more fluent outputs. BLEU cannot capture human preferences because references are translationese when source sentences are natural text. We recommend complementing BLEU with a language model score to measure fluency.</abstract>
       <url hash="87df2389">2020.acl-main.253</url>
       <doi>10.18653/v1/2020.acl-main.253</doi>
+      <attachment type="Dataset" hash="5143af2c">2020.acl-main.253.Dataset.zip</attachment>
     </paper>
     <paper id="254">
       <title>Simultaneous Translation Policies: From Fixed to Adaptive</title>
@@ -3022,6 +3036,7 @@
       <abstract>Recent developments in Neural Relation Extraction (NRE) have made significant strides towards Automated Knowledge Base Construction. While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to evaluate social biases exhibited in NRE systems. In this paper, we create WikiGenderBias, a distantly supervised dataset composed of over 45,000 sentences including a 10% human annotated test set for the purpose of analyzing gender bias in relation extraction systems. We find that when extracting spouse-of and hypernym (i.e., occupation) relations, an NRE system performs differently when the gender of the target entity is different. However, such disparity does not appear when extracting relations such as birthDate or birthPlace. We also analyze how existing bias mitigation techniques, such as name anonymization, word embedding debiasing, and data augmentation affect the NRE system in terms of maintaining the test performance and reducing biases. Unfortunately, due to NRE models rely heavily on surface level cues, we find that existing bias mitigation approaches have a negative effect on NRE. Our analysis lays groundwork for future quantifying and mitigating bias in NRE.</abstract>
       <url hash="36be5ced">2020.acl-main.265</url>
       <doi>10.18653/v1/2020.acl-main.265</doi>
+      <attachment type="Dataset" hash="2accf4b1">2020.acl-main.265.Dataset.zip</attachment>
     </paper>
     <paper id="266">
       <title>A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing</title>
@@ -3288,6 +3303,7 @@
       <url hash="40e371f2">2020.acl-main.287</url>
       <attachment type="Software" hash="bced880b">2020.acl-main.287.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.287</doi>
+      <attachment type="Dataset" hash="11af06d9">2020.acl-main.287.Dataset.pdf</attachment>
     </paper>
     <paper id="288">
       <title><fixed-case>ECPE</fixed-case>-2<fixed-case>D</fixed-case>: Emotion-Cause Pair Extraction based on Joint Two-Dimensional Representation, Interaction and Prediction</title>
@@ -3399,6 +3415,7 @@
       <url hash="021094c1">2020.acl-main.297</url>
       <attachment type="Software" hash="3fa34694">2020.acl-main.297.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.297</doi>
+      <attachment type="Dataset" hash="f71419b8">2020.acl-main.297.Dataset.pdf</attachment>
     </paper>
     <paper id="298">
       <title>Towards Better Non-Tree Argument Mining: Proposition-Level Biaffine Parsing with Task-Specific Parameterization</title>
@@ -3638,6 +3655,7 @@
       <url hash="61c3189a">2020.acl-main.319</url>
       <attachment type="Software" hash="ce483f73">2020.acl-main.319.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.319</doi>
+      <attachment type="Dataset" hash="79345d52">2020.acl-main.319.Dataset.pdf</attachment>
     </paper>
     <paper id="320">
       <title>A Retrieve-and-Rewrite Initialization Method for Unsupervised Machine Translation</title>
@@ -4605,6 +4623,7 @@
       <url hash="1ccb1729">2020.acl-main.405</url>
       <attachment type="Software" hash="dd406aa4">2020.acl-main.405.Software.pdf</attachment>
       <doi>10.18653/v1/2020.acl-main.405</doi>
+      <attachment type="Dataset" hash="a21c6657">2020.acl-main.405.Dataset.zip</attachment>
     </paper>
     <paper id="406">
       <title>“Who said it, and Why?” Provenance for Natural Language Claims</title>
@@ -4843,6 +4862,7 @@
       <abstract>In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences, paraphrase words (i.e. replacing complex words or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that are focused on a single transformation, such as lexical paraphrasing or splitting. This makes it impossible to understand the ability of simplification models in more realistic settings. To alleviate this limitation, this paper introduces ASSET, a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations. Through quantitative and qualitative experiments, we show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task. Furthermore, we motivate the need for developing better methods for automatic evaluation using ASSET, since we show that current popular metrics may not be suitable when multiple simplification transformations are performed.</abstract>
       <url hash="556a5e45">2020.acl-main.424</url>
       <doi>10.18653/v1/2020.acl-main.424</doi>
+      <attachment type="Dataset" hash="5ed6ce68">2020.acl-main.424.Dataset.zip</attachment>
     </paper>
     <paper id="425">
       <title>Fatality Killed the Cat or: <fixed-case>B</fixed-case>abel<fixed-case>P</fixed-case>ic, a Multimodal Dataset for Non-Concrete Concepts</title>
@@ -4876,6 +4896,7 @@
       <abstract>We propose a semantic parsing dataset focused on instruction-driven communication with an agent in the game Minecraft. The dataset consists of 7K human utterances and their corresponding parses. Given proper world state, the parses can be interpreted and executed in game. We report the performance of baseline models, and analyze their successes and failures.</abstract>
       <url hash="46956cab">2020.acl-main.427</url>
       <doi>10.18653/v1/2020.acl-main.427</doi>
+      <attachment type="Dataset" hash="5a6e1409">2020.acl-main.427.Dataset.zip</attachment>
     </paper>
     <paper id="428">
       <title>Don’t Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training</title>
@@ -5054,6 +5075,7 @@
       <abstract>There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F-1 score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model. Our code and data are available at: https://github.com/jeniyat/StackOverflowNER/</abstract>
       <url hash="91c9adef">2020.acl-main.443</url>
       <doi>10.18653/v1/2020.acl-main.443</doi>
+      <attachment type="Dataset" hash="1545b277">2020.acl-main.443.Dataset.zip</attachment>
     </paper>
     <paper id="444">
       <title>Dialogue-Based Relation Extraction</title>
@@ -5263,6 +5285,7 @@
       <abstract>Human speakers have an extensive toolkit of ways to express themselves. In this paper, we engage with an idea largely absent from discussions of meaning in natural language understanding—namely, that the way something is expressed reflects different ways of conceptualizing or construing the information being conveyed. We first define this phenomenon more precisely, drawing on considerable prior work in theoretical cognitive semantics and psycholinguistics. We then survey some dimensions of construed meaning and show how insights from construal could inform theoretical and practical work in NLP.</abstract>
       <url hash="40365e06">2020.acl-main.462</url>
       <doi>10.18653/v1/2020.acl-main.462</doi>
+      <attachment type="Software" hash="493c71d1">2020.acl-main.462.Software.zip</attachment>
     </paper>
     <paper id="463">
       <title>Climbing towards <fixed-case>NLU</fixed-case>: <fixed-case>On</fixed-case> Meaning, Form, and Understanding in the Age of Data</title>
@@ -5368,6 +5391,7 @@
       <abstract>Not all documents are equally important. Language processing is increasingly finding use as a supplement for questionnaires to assess psychological attributes of consenting individuals, but most approaches neglect to consider whether all documents of an individual are equally informative. In this paper, we present a novel model that uses message-level attention to learn the relative weight of users’ social media posts for assessing their five factor personality traits. We demonstrate that models with message-level attention outperform those with word-level attention, and ultimately yield state-of-the-art accuracies for all five traits by using both word and message attention in combination with past approaches (an average increase in Pearson r of 2.5%). In addition, examination of the high-signal posts identified by our model provides insight into the relationship between language and personality, helping to inform future work.</abstract>
       <url hash="2284aac2">2020.acl-main.472</url>
       <doi>10.18653/v1/2020.acl-main.472</doi>
+      <attachment type="Dataset" hash="185add3b">2020.acl-main.472.Dataset.pdf</attachment>
     </paper>
     <paper id="473">
       <title>Measuring Forecasting Skill from Text</title>
@@ -5435,6 +5459,7 @@
       <abstract>Understanding discourse structures of news articles is vital to effectively contextualize the occurrence of a news event. To enable computational modeling of news structures, we apply an existing theory of functional discourse structure for news articles that revolves around the main event and create a human-annotated corpus of 802 documents spanning over four domains and three media sources. Next, we propose several document-level neural-network models to automatically construct news content structures. Finally, we demonstrate that incorporating system predicted news structures yields new state-of-the-art performance for event coreference resolution. The news documents we annotated are openly available and the annotations are publicly released for future research.</abstract>
       <url hash="0c08c1d6">2020.acl-main.478</url>
       <doi>10.18653/v1/2020.acl-main.478</doi>
+      <attachment type="Dataset" hash="8fd9d6a5">2020.acl-main.478.Dataset.zip</attachment>
     </paper>
     <paper id="479">
       <title>Harnessing the linguistic signal to predict scalar inferences</title>
@@ -5529,6 +5554,7 @@
       <abstract>Warning: this paper contains content that may be offensive or upsetting. Language has the power to reinforce stereotypes and project social biases onto others. At the core of the challenge is that it is rarely what is stated explicitly, but rather the implied meanings, that frame people’s judgments about others. For example, given a statement that “we shouldn’t lower our standards to hire more women,” most listeners will infer the implicature intended by the speaker - that “women (candidates) are less qualified.” Most semantic formalisms, to date, do not capture such pragmatic implications in which people express social biases and power differentials in language. We introduce Social Bias Frames, a new conceptual formalism that aims to model the pragmatic frames in which people project social biases and stereotypes onto others. In addition, we introduce the Social Bias Inference Corpus to support large-scale modelling and evaluation with 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups. We then establish baseline approaches that learn to recover Social Bias Frames from unstructured text. We find that while state-of-the-art neural models are effective at high-level categorization of whether a given statement projects unwanted social bias (80% F1), they are not effective at spelling out more detailed explanations in terms of Social Bias Frames. Our study motivates future work that combines structured pragmatic inference with commonsense reasoning on social implications.</abstract>
       <url hash="c6b04058">2020.acl-main.486</url>
       <doi>10.18653/v1/2020.acl-main.486</doi>
+      <attachment type="Dataset" hash="b3c5ea6a">2020.acl-main.486.Dataset.tgz</attachment>
     </paper>
     <paper id="487">
       <title>Social Biases in <fixed-case>NLP</fixed-case> Models as Barriers for Persons with Disabilities</title>
@@ -5642,6 +5668,7 @@
       <abstract>Selecting input features of top relevance has become a popular method for building self-explaining models. In this work, we extend this selective rationalization approach to text matching, where the goal is to jointly select and align text pieces, such as tokens or sentences, as a justification for the downstream prediction. Our approach employs optimal transport (OT) to find a minimal cost alignment between the inputs. However, directly applying OT often produces dense and therefore uninterpretable alignments. To overcome this limitation, we introduce novel constrained variants of the OT problem that result in highly sparse alignments with controllable sparsity. Our model is end-to-end differentiable using the Sinkhorn algorithm for OT and can be trained without any alignment annotations. We evaluate our model on the StackExchange, MultiNews, e-SNLI, and MultiRC datasets. Our model achieves very sparse rationale selections with high fidelity while preserving prediction accuracy compared to strong attention baseline models.</abstract>
       <url hash="74fac845">2020.acl-main.496</url>
       <doi>10.18653/v1/2020.acl-main.496</doi>
+      <attachment type="Dataset" hash="cf47010b">2020.acl-main.496.Dataset.pdf</attachment>
     </paper>
     <paper id="497">
       <title>Benefits of Intermediate Annotations in Reading Comprehension</title>
@@ -6195,6 +6222,7 @@
       <abstract>Given a sentence and its relevant answer, how to ask good questions is a challenging task, which has many real applications. Inspired by human’s paraphrasing capability to ask questions of the same meaning but with diverse expressions, we propose to incorporate paraphrase knowledge into question generation(QG) to generate human-like questions. Specifically, we present a two-hand hybrid model leveraging a self-built paraphrase resource, which is automatically conducted by a simple back-translation method. On the one hand, we conduct multi-task learning with sentence-level paraphrase generation (PG) as an auxiliary task to supplement paraphrase knowledge to the task-share encoder. On the other hand, we adopt a new loss function for diversity training to introduce more question patterns to QG. Extensive experimental results show that our proposed model obtains obvious performance gain over several strong baselines, and further human evaluation validates that our model can ask questions of high quality by leveraging paraphrase knowledge.</abstract>
       <url hash="6d34c6ed">2020.acl-main.545</url>
       <doi>10.18653/v1/2020.acl-main.545</doi>
+      <attachment type="Dataset" hash="a9b33208">2020.acl-main.545.Dataset.zip</attachment>
     </paper>
     <paper id="546">
       <title><fixed-case>N</fixed-case>eu<fixed-case>I</fixed-case>nfer: Knowledge Inference on <fixed-case>N</fixed-case>-ary Facts</title>
@@ -6473,6 +6501,7 @@
       <url hash="1098f9fc">2020.acl-main.569</url>
       <attachment type="Software" hash="00fcb529">2020.acl-main.569.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.569</doi>
+      <attachment type="Dataset" hash="65803fc5">2020.acl-main.569.Dataset.pdf</attachment>
     </paper>
     <paper id="570">
       <title>Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification</title>
@@ -6641,6 +6670,7 @@
       <url hash="cee6bd19">2020.acl-main.583</url>
       <attachment type="Software" hash="a9a731aa">2020.acl-main.583.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.583</doi>
+      <attachment type="Dataset" hash="f67d1eea">2020.acl-main.583.Dataset.tsv</attachment>
     </paper>
     <paper id="584">
       <title><fixed-case>K</fixed-case>nowledge Supports Visual Language Grounding: <fixed-case>A</fixed-case> Case Study on Colour Terms</title>
@@ -6695,6 +6725,7 @@
       <abstract>Aspect-based sentiment classification is a popular task aimed at identifying the corresponding emotion of a specific aspect. One sentence may contain various sentiments for different aspects. Many sophisticated methods such as attention mechanism and Convolutional Neural Networks (CNN) have been widely employed for handling this challenge. Recently, semantic dependency tree implemented by Graph Convolutional Networks (GCN) is introduced to describe the inner connection between aspects and the associated emotion words. But the improvement is limited due to the noise and instability of dependency trees. To this end, we propose a dependency graph enhanced dual-transformer network (named DGEDT) by jointly considering the flat representations learnt from Transformer and graph-based representations learnt from the corresponding dependency graph in an iterative interaction manner. Specifically, a dual-transformer structure is devised in DGEDT to support mutual reinforcement between the flat representation learning and graph-based representation learning. The idea is to allow the dependency graph to guide the representation learning of the transformer encoder and vice versa. The results on five datasets demonstrate that the proposed DGEDT outperforms all state-of-the-art alternatives with a large margin.</abstract>
       <url hash="2233d5ca">2020.acl-main.588</url>
       <doi>10.18653/v1/2020.acl-main.588</doi>
+      <attachment type="Dataset" hash="d524f4bc">2020.acl-main.588.Dataset.zip</attachment>
     </paper>
     <paper id="589">
       <title>Differentiable Window for Dynamic Local Attention</title>
@@ -6808,6 +6839,7 @@
       <url hash="17bb637a">2020.acl-main.597</url>
       <attachment type="Software" hash="d9ab711a">2020.acl-main.597.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.597</doi>
+      <attachment type="Dataset" hash="4d03ad74">2020.acl-main.597.Dataset.pdf</attachment>
     </paper>
     <paper id="598">
       <title>Unsupervised Morphological Paradigm Completion</title>
@@ -7008,6 +7040,7 @@
       <abstract>Showing items that do not match search query intent degrades customer experience in e-commerce. These mismatches result from counterfactual biases of the ranking algorithms toward noisy behavioral signals such as clicks and purchases in the search logs. Mitigating the problem requires a large labeled dataset, which is expensive and time-consuming to obtain. In this paper, we develop a deep, end-to-end model that learns to effectively classify mismatches and to generate hard mismatched examples to improve the classifier. We train the model end-to-end by introducing a latent variable into the cross-entropy loss that alternates between using the real and generated samples. This not only makes the classifier more robust but also boosts the overall ranking performance. Our model achieves a relative gain compared to baselines by over 26% in F-score, and over 17% in Area Under PR curve. On live search traffic, our model gains significant improvement in multiple countries.</abstract>
       <url hash="2a511bec">2020.acl-main.614</url>
       <doi>10.18653/v1/2020.acl-main.614</doi>
+      <attachment type="Dataset" hash="d61a930c">2020.acl-main.614.Dataset.pdf</attachment>
     </paper>
     <paper id="615">
       <title>Generalized Entropy Regularization or: There’s Nothing Special about Label Smoothing</title>
@@ -7630,6 +7663,7 @@
       <abstract>Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https://github.com/allenai/SciREX .</abstract>
       <url hash="dd47c0ed">2020.acl-main.670</url>
       <doi>10.18653/v1/2020.acl-main.670</doi>
+      <attachment type="Dataset" hash="90730196">2020.acl-main.670.Dataset.tgz</attachment>
     </paper>
     <paper id="671">
       <title>Contrastive Self-Supervised Learning for Commonsense Reasoning</title>
@@ -7727,6 +7761,7 @@
       <abstract>Large-scale pretrained language models are the major driving force behind recent improvements in perfromance on the Winograd Schema Challenge, a widely employed test of commonsense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones.</abstract>
       <url hash="e05fe25e">2020.acl-main.679</url>
       <doi>10.18653/v1/2020.acl-main.679</doi>
+      <attachment type="Dataset" hash="40224b1e">2020.acl-main.679.Dataset.zip</attachment>
     </paper>
     <paper id="680">
       <title>Temporally-Informed Analysis of Named Entity Recognition</title>
@@ -7940,6 +7975,7 @@
       <url hash="2ac03cc9">2020.acl-main.699</url>
       <attachment type="Software" hash="68b7c6c1">2020.acl-main.699.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.699</doi>
+      <attachment type="Dataset" hash="631de170">2020.acl-main.699.Dataset.zip</attachment>
     </paper>
     <paper id="700">
       <title>Returning the <fixed-case>N</fixed-case> to <fixed-case>NLP</fixed-case>: <fixed-case>T</fixed-case>owards Contextually Personalized Classification Models</title>
@@ -7961,6 +7997,7 @@
       <abstract>Many tasks aim to measure machine reading comprehension (MRC), often focusing on question types presumed to be difficult. Rarely, however, do task designers start by considering what systems should in fact comprehend. In this paper we make two key contributions. First, we argue that existing approaches do not adequately define comprehension; they are too unsystematic about what content is tested. Second, we present a detailed definition of comprehension—a “Template of Understanding”—for a widely useful class of texts, namely short narratives. We then conduct an experiment that strongly suggests existing systems are not up to the task of narrative understanding as we define it.</abstract>
       <url hash="882bd9a2">2020.acl-main.701</url>
       <doi>10.18653/v1/2020.acl-main.701</doi>
+      <attachment type="Dataset" hash="584c3dd6">2020.acl-main.701.Dataset.tgz</attachment>
     </paper>
     <paper id="702">
       <title>Gender Gap in Natural Language Processing Research: Disparities in Authorship and Citations</title>
@@ -8059,6 +8096,7 @@
       <url hash="f215d567">2020.acl-main.709</url>
       <attachment type="Software" hash="ff5f484a">2020.acl-main.709.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.709</doi>
+      <attachment type="Dataset" hash="c3c579f7">2020.acl-main.709.Dataset.zip</attachment>
     </paper>
     <paper id="710">
       <title>One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases</title>
@@ -8084,6 +8122,7 @@
       <abstract>We propose an unsupervised approach for sarcasm generation based on a non-sarcastic input sentence. Our method employs a retrieve-and-edit framework to instantiate two major characteristics of sarcasm: reversal of valence and semantic incongruity with the context, which could include shared commonsense or world knowledge between the speaker and the listener. While prior works on sarcasm generation predominantly focus on context incongruity, we show that combining valence reversal and semantic incongruity based on the commonsense knowledge generates sarcasm of higher quality. Human evaluation shows that our system generates sarcasm better than humans 34% of the time, and better than a reinforced hybrid baseline 90% of the time.</abstract>
       <url hash="27d5c082">2020.acl-main.711</url>
       <doi>10.18653/v1/2020.acl-main.711</doi>
+      <attachment type="Dataset" hash="f4fc07b8">2020.acl-main.711.Dataset.pdf</attachment>
     </paper>
     <paper id="712">
       <title>Structural Information Preserving for Graph-to-Text Generation</title>
@@ -8183,6 +8222,7 @@
       <url hash="a16c3131">2020.acl-main.720</url>
       <attachment type="Software" hash="fef8127c">2020.acl-main.720.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-main.720</doi>
+      <attachment type="Dataset" hash="ef2a65aa">2020.acl-main.720.Dataset.pdf</attachment>
     </paper>
     <paper id="721">
       <title><fixed-case>Z</fixed-case>ero<fixed-case>S</fixed-case>hot<fixed-case>C</fixed-case>eres: Zero-Shot Relation Extraction from Semi-Structured Webpages</title>
@@ -8641,6 +8681,7 @@
       <abstract>The increased focus on misinformation has spurred development of data and systems for detecting the veracity of a claim as well as retrieving authoritative evidence. The Fact Extraction and VERification (FEVER) dataset provides such a resource for evaluating endto- end fact-checking, requiring retrieval of evidence from Wikipedia to validate a veracity prediction. We show that current systems for FEVER are vulnerable to three categories of realistic challenges for fact-checking – multiple propositions, temporal reasoning, and ambiguity and lexical variation – and introduce a resource with these types of claims. Then we present a system designed to be resilient to these “attacks” using multiple pointer networks for document selection and jointly modeling a sequence of evidence sentences and veracity relation predictions. We find that in handling these attacks we obtain state-of-the-art results on FEVER, largely due to improved evidence retrieval.</abstract>
       <url hash="0da037fb">2020.acl-main.761</url>
       <doi>10.18653/v1/2020.acl-main.761</doi>
+      <attachment type="Dataset" hash="bd72c15e">2020.acl-main.761.Dataset.zip</attachment>
     </paper>
     <paper id="762">
       <title>Let Me Choose: From Verbal Context to Font Selection</title>
@@ -8723,6 +8764,7 @@
       <abstract>Natural language inference (NLI) is an increasingly important task for natural language understanding, which requires one to infer whether a sentence entails another. However, the ability of NLI models to make pragmatic inferences remains understudied. We create an IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of 32K semi-automatically generated sentence pairs illustrating well-studied pragmatic inference types. We use IMPPRES to evaluate whether BERT, InferSent, and BOW NLI models trained on MultiNLI (Williams et al., 2018) learn to make pragmatic inferences. Although MultiNLI appears to contain very few pairs illustrating these inference types, we find that BERT learns to draw pragmatic inferences. It reliably treats scalar implicatures triggered by “some” as entailments. For some presupposition triggers like “only”, BERT reliably recognizes the presupposition as an entailment, even when the trigger is embedded under an entailment canceling operator like negation. BOW and InferSent show weaker evidence of pragmatic reasoning. We conclude that NLI training encourages models to learn some, but not all, pragmatic inferences.</abstract>
       <url hash="cda6aca5">2020.acl-main.768</url>
       <doi>10.18653/v1/2020.acl-main.768</doi>
+      <attachment type="Dataset" hash="bf0284ee">2020.acl-main.768.Dataset.zip</attachment>
     </paper>
     <paper id="769">
       <title>End-to-End Bias Mitigation by Modelling Biases in Corpora</title>
@@ -9139,6 +9181,7 @@
       <url hash="496da25f">2020.acl-demos.23</url>
       <attachment type="Software" hash="2baa266b">2020.acl-demos.23.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-demos.23</doi>
+      <attachment type="Dataset" hash="c7f2fe33">2020.acl-demos.23.Dataset.pdf</attachment>
     </paper>
     <paper id="24">
       <title><fixed-case>P</fixed-case>hoton: A Robust Cross-Domain Text-to-<fixed-case>SQL</fixed-case> System</title>
@@ -9265,6 +9308,7 @@
       <url hash="34021f55">2020.acl-demos.33</url>
       <attachment type="Software" hash="811a882b">2020.acl-demos.33.Software.zip</attachment>
       <doi>10.18653/v1/2020.acl-demos.33</doi>
+      <attachment type="Dataset" hash="191573c8">2020.acl-demos.33.Dataset.pdf</attachment>
     </paper>
     <paper id="34">
       <title><fixed-case>ESP</fixed-case>net-<fixed-case>ST</fixed-case>: All-in-One Speech Translation Toolkit</title>
@@ -9592,6 +9636,7 @@
       <abstract>We present a simple and effective dependency parser for Telugu, a morphologically rich, free word order language. We propose to replace the rich linguistic feature templates used in the past approaches with a minimal feature function using contextual vector representations. We train a BERT model on the Telugu Wikipedia data and use vector representations from this model to train the parser. Each sentence token is associated with a vector representing the token in the context of that sentence and the feature vectors are constructed by concatenating two token representations from the stack and one from the buffer. We put the feature representations through a feedforward network and train with a greedy transition based approach. The resulting parser has a very simple architecture with minimal feature engineering and achieves state-of-the-art results for Telugu.</abstract>
       <url hash="249d98d6">2020.acl-srw.19</url>
       <doi>10.18653/v1/2020.acl-srw.19</doi>
+      <attachment type="Dataset" hash="c3d4e505">2020.acl-srw.19.Dataset.zip</attachment>
     </paper>
     <paper id="20">
       <title>Pointwise Paraphrase Appraisal is Potentially Problematic</title>
@@ -9612,6 +9657,8 @@
       <abstract>A large percentage of the world’s population speaks a language of the Indian subcontinent, comprising languages from both Indo-Aryan (e.g. Hindi, Punjabi, Gujarati, etc.) and Dravidian (e.g. Tamil, Telugu, Malayalam, etc.) families. A universal characteristic of Indian languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high-quality parallel data, can make developing machine translation (MT) systems for these languages difficult. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. Since the condition of large parallel corpora is not met for Indian-English language pairs, we present our efforts towards building efficient NMT systems between Indian languages (specifically Indo-Aryan languages) and English via efficiently exploiting parallel data from the related languages. We propose a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs. We also propose a Multilingual Transfer Learning technique to leverage parallel data from multiple related languages to assist translation for low resource language pair of interest. Our experiments demonstrate an overall average improvement of 5 BLEU points over the standard Transformer-based NMT baselines.</abstract>
       <url hash="de3039b6">2020.acl-srw.22</url>
       <doi>10.18653/v1/2020.acl-srw.22</doi>
+      <attachment type="Dataset" hash="6723ae53">2020.acl-srw.22.Dataset.zip</attachment>
+      <attachment type="Software" hash="e6ce6641">2020.acl-srw.22.Software.zip</attachment>
     </paper>
     <paper id="23">
       <title>Exploring Interpretability in Event Extraction: Multitask Learning of a Neural Event Classifier and an Explanation Decoder</title>
@@ -9779,6 +9826,7 @@
       <abstract>Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks. However, large monolingual corpora might not always be available for the languages of interest (LOI). Thus, we propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the LOI. We utilize script mapping (Chinese to Japanese) to increase the similarity (number of cognates) between the monolingual corpora of helping languages and LOI. An empirical case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora, respectively, for S2S pre-training. Using only Chinese and French monolingual corpora, we were able to improve Japanese-English translation quality by up to 8.5 BLEU in low-resource scenarios.</abstract>
       <url hash="7d1db2d0">2020.acl-srw.37</url>
       <doi>10.18653/v1/2020.acl-srw.37</doi>
+      <attachment type="Software" hash="fff9ef52">2020.acl-srw.37.Software.zip</attachment>
     </paper>
     <paper id="38">
       <title>Checkpoint Reranking: An Approach to Select Better Hypothesis for Neural Machine Translation Systems</title>
diff --git a/data/xml/2020.bea.xml b/data/xml/2020.bea.xml
index 1785498786..78ae38c57b 100644
--- a/data/xml/2020.bea.xml
+++ b/data/xml/2020.bea.xml
@@ -87,6 +87,7 @@
       <abstract>In this paper we employ a novel approach to advancing our understanding of the development of writing in English and German children across school grades using classification tasks. The data used come from two recently compiled corpora: The English data come from the the GiC corpus (983 school children in second-, sixth-, ninth- and eleventh-grade) and the German data are from the FD-LEX corpus (930 school children in fifth- and ninth-grade). The key to this paper is the combined use of what we refer to as ‘complexity contours’, i.e. series of measurements that capture the progression of linguistic complexity within a text, and Recurrent Neural Network (RNN) classifiers that adequately capture the sequential information in those contours. Our experiments demonstrate that RNN classifiers trained on complexity contours achieve higher classification accuracy than one trained on text-average complexity scores. In a second step, we determine the relative importance of the features from four distinct categories through a Sensitivity-Based Pruning approach.</abstract>
       <url hash="b544b2f5">2020.bea-1.6</url>
       <doi>10.18653/v1/2020.bea-1.6</doi>
+      <attachment type="Dataset" hash="6275ac98">2020.bea-1.6.Dataset.pdf</attachment>
     </paper>
     <paper id="7">
       <title>Annotation and Classification of Evidence and Reasoning Revisions in Argumentative Writing</title>
@@ -205,6 +206,7 @@
       <abstract>Complex Word Identification (CWI) is a task for the identification of words that are challenging for second-language learners to read. Even though the use of neural classifiers is now common in CWI, the interpretation of their parameters remains difficult. This paper analyzes neural CWI classifiers and shows that some of their parameters can be interpreted as vocabulary size. We present a novel formalization of vocabulary size measurement methods that are practiced in the applied linguistics field as a kind of neural classifier. We also contribute to building a novel dataset for validating vocabulary testing and readability via crowdsourcing.</abstract>
       <url hash="f7390621">2020.bea-1.17</url>
       <doi>10.18653/v1/2020.bea-1.17</doi>
+      <attachment type="Dataset" hash="de00ecff">2020.bea-1.17.Dataset.zip</attachment>
     </paper>
     <paper id="18">
       <title>Automated Scoring of Clinical Expressive Language Evaluation Tasks</title>
diff --git a/data/xml/2020.bionlp.xml b/data/xml/2020.bionlp.xml
index ff54883fef..1e721613aa 100644
--- a/data/xml/2020.bionlp.xml
+++ b/data/xml/2020.bionlp.xml
@@ -130,6 +130,7 @@
       <abstract>Text classification tasks which aim at harvesting and/or organizing information from electronic health records are pivotal to support clinical and translational research. However these present specific challenges compared to other classification tasks, notably due to the particular nature of the medical lexicon and language used in clinical records. Recent advances in embedding methods have shown promising results for several clinical tasks, yet there is no exhaustive comparison of such approaches with other commonly used word representations and classification models. In this work, we analyse the impact of various word representations, text pre-processing and classification algorithms on the performance of four different text classification tasks. The results show that traditional approaches, when tailored to the specific language and structure of the text inherent to the classification task, can achieve or exceed the performance of more recent ones based on contextual embeddings such as BERT.</abstract>
       <url hash="a505a102">2020.bionlp-1.9</url>
       <doi>10.18653/v1/2020.bionlp-1.9</doi>
+      <attachment type="Dataset" hash="47149249">2020.bionlp-1.9.Dataset.pdf</attachment>
     </paper>
     <paper id="10">
       <title>Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning</title>
diff --git a/data/xml/2020.figlang.xml b/data/xml/2020.figlang.xml
index 4d9bdde77d..cb08582c35 100644
--- a/data/xml/2020.figlang.xml
+++ b/data/xml/2020.figlang.xml
@@ -261,6 +261,7 @@
       <url hash="10dcef0a">2020.figlang-1.23</url>
       <attachment type="Software" hash="3084e139">2020.figlang-1.23.Software.zip</attachment>
       <doi>10.18653/v1/2020.figlang-1.23</doi>
+      <attachment type="Dataset" hash="5d983303">2020.figlang-1.23.Dataset.pdf</attachment>
     </paper>
     <paper id="24">
       <title><fixed-case>O</fixed-case>xymorons: a preliminary corpus investigation</title>
diff --git a/data/xml/2020.iwpt.xml b/data/xml/2020.iwpt.xml
index eba7ff494e..bff81ba16e 100644
--- a/data/xml/2020.iwpt.xml
+++ b/data/xml/2020.iwpt.xml
@@ -96,6 +96,7 @@
       <abstract>Semiring parsing is an elegant framework for describing parsers by using semiring weighted logic programs. In this paper we present a generalization of this concept: latent-variable semiring parsing. With our framework, any semiring weighted logic program can be latentified by transforming weights from scalar values of a semiring to rank-n arrays, or tensors, of semiring values, allowing the modelling of latent-variable models within the semiring parsing framework. Semiring is too strong a notion when dealing with tensors, and we have to resort to a weaker structure: a partial semiring. We prove that this generalization preserves all the desired properties of the original semiring framework while strictly increasing its expressiveness.</abstract>
       <url hash="36981a2b">2020.iwpt-1.8</url>
       <doi>10.18653/v1/2020.iwpt-1.8</doi>
+      <attachment type="Dataset" hash="322b1060">2020.iwpt-1.8.Dataset.pdf</attachment>
     </paper>
     <paper id="9">
       <title>Advances in Using Grammars with Latent Annotations for Discontinuous Parsing</title>
diff --git a/data/xml/2020.ngt.xml b/data/xml/2020.ngt.xml
index a6e21e4f45..cc9fe09443 100644
--- a/data/xml/2020.ngt.xml
+++ b/data/xml/2020.ngt.xml
@@ -35,6 +35,7 @@
       <abstract>We describe the finding of the Fourth Workshop on Neural Generation and Translation, held in concert with the annual conference of the Association for Computational Linguistics (ACL 2020). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the three shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document-level generation and translation (DGT) where participants were tasked with developing systems that generate summaries from structured data, potentially with assistance from text in another language and 3) STAPLE task: creation of as many possible translations of a given input text. This last shared task was organised by Duolingo.</abstract>
       <url hash="c1b45c54">2020.ngt-1.1</url>
       <doi>10.18653/v1/2020.ngt-1.1</doi>
+      <attachment type="Dataset" hash="bf7e633a">2020.ngt-1.1.Dataset.txt</attachment>
     </paper>
     <paper id="2">
       <title>Learning to Generate Multiple Style Transfer Outputs for an Input Sentence</title>
@@ -296,6 +297,7 @@
       <abstract>We participated in all tracks of the Workshop on Neural Generation and Translation 2020 Efficiency Shared Task: single-core CPU, multi-core CPU, and GPU. At the model level, we use teacher-student training with a variety of student sizes, tie embeddings and sometimes layers, use the Simpler Simple Recurrent Unit, and introduce head pruning. On GPUs, we used 16-bit floating-point tensor cores. On CPUs, we customized 8-bit quantization and multiple processes with affinity for the multi-core setting. To reduce model size, we experimented with 4-bit log quantization but use floats at runtime. In the shared task, most of our submissions were Pareto optimal with respect the trade-off between time and quality.</abstract>
       <url hash="b267cea9">2020.ngt-1.26</url>
       <doi>10.18653/v1/2020.ngt-1.26</doi>
+      <attachment type="Dataset" hash="ffd898b7">2020.ngt-1.26.Dataset.txt</attachment>
     </paper>
     <paper id="27">
       <title>Improving Document-Level Neural Machine Translation with Domain Adaptation</title>
diff --git a/data/xml/2020.nlp4convai.xml b/data/xml/2020.nlp4convai.xml
index 82fa66fbf9..ec3227ecb7 100644
--- a/data/xml/2020.nlp4convai.xml
+++ b/data/xml/2020.nlp4convai.xml
@@ -74,6 +74,7 @@
       <abstract>Building conversational systems in new domains and with added functionality requires resource-efficient models that work under low-data regimes (i.e., in few-shot setups). Motivated by these requirements, we introduce intent detection methods backed by pretrained dual sentence encoders such as USE and ConveRT. We demonstrate the usefulness and wide applicability of the proposed intent detectors, showing that: 1) they outperform intent detectors based on fine-tuning the full BERT-Large model or using BERT as a fixed black-box encoder on three diverse intent detection data sets; 2) the gains are especially pronounced in few-shot setups (i.e., with only 10 or 30 annotated examples per intent); 3) our intent detectors can be trained in a matter of minutes on a single CPU; and 4) they are stable across different hyperparameter settings. In hope of facilitating and democratizing research focused on intention detection, we release our code, as well as a new challenging single-domain intent detection dataset comprising 13,083 annotated examples over 77 intents.</abstract>
       <url hash="c3636fef">2020.nlp4convai-1.5</url>
       <doi>10.18653/v1/2020.nlp4convai-1.5</doi>
+      <attachment type="Dataset" hash="0666d8ca">2020.nlp4convai-1.5.Dataset.zip</attachment>
     </paper>
     <paper id="6">
       <title>Accelerating Natural Language Understanding in Task-Oriented Dialog</title>
@@ -104,6 +105,7 @@
       <abstract>Speech-based virtual assistants, such as Amazon Alexa, Google assistant, and Apple Siri, typically convert users’ audio signals to text data through automatic speech recognition (ASR) and feed the text to downstream dialog models for natural language understanding and response generation. The ASR output is error-prone; however, the downstream dialog models are often trained on error-free text data, making them sensitive to ASR errors during inference time. To bridge the gap and make dialog models more robust to ASR errors, we leverage an ASR error simulator to inject noise into the error-free text data, and subsequently train the dialog models with the augmented data. Compared to other approaches for handling ASR errors, such as using ASR lattice or end-to-end methods, our data augmentation approach does not require any modification to the ASR or downstream dialog models; our approach also does not introduce any additional latency during inference time. We perform extensive experiments on benchmark data and show that our approach improves the performance of downstream dialog models in the presence of ASR errors, and it is particularly effective in the low-resource situations where there are constraints on model size or the training data is scarce.</abstract>
       <url hash="e4c00489">2020.nlp4convai-1.8</url>
       <doi>10.18653/v1/2020.nlp4convai-1.8</doi>
+      <attachment type="Dataset" hash="aa8868d4">2020.nlp4convai-1.8.Dataset.zip</attachment>
     </paper>
     <paper id="9">
       <title>Automating Template Creation for Ranking-Based Dialogue Models</title>
@@ -117,6 +119,7 @@
       <url hash="e2fc267c">2020.nlp4convai-1.9</url>
       <attachment type="Software" hash="d3cfb0f6">2020.nlp4convai-1.9.Software.txt</attachment>
       <doi>10.18653/v1/2020.nlp4convai-1.9</doi>
+      <attachment type="Software" hash="b5003fb1">2020.nlp4convai-1.9.Software.zip</attachment>
     </paper>
     <paper id="10">
       <title>From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap</title>
diff --git a/data/xml/2020.nuse.xml b/data/xml/2020.nuse.xml
index ff7ee62eb3..d55dd30aad 100644
--- a/data/xml/2020.nuse.xml
+++ b/data/xml/2020.nuse.xml
@@ -88,6 +88,7 @@
       <url hash="c659338e">2020.nuse-1.6</url>
       <attachment type="Software" hash="0118ef2f">2020.nuse-1.6.Software.zip</attachment>
       <doi>10.18653/v1/2020.nuse-1.6</doi>
+      <attachment type="Dataset" hash="fbb502be">2020.nuse-1.6.Dataset.pdf</attachment>
     </paper>
     <paper id="7">
       <title>Script Induction as Association Rule Mining</title>
@@ -106,6 +107,7 @@
       <abstract>In this paper we introduce the problem of extracting events from dialogue. Previous work on event extraction focused on newswire, however we are interested in extracting events from spoken dialogue. To ground this study, we annotated dialogue transcripts from fourteen episodes of the podcast This American Life. This corpus contains 1,038 utterances, made up of 16,962 tokens, of which 3,664 represent events. The agreement for this corpus has a Cohen’s Kappa of 0.83. We have open-sourced this corpus for the NLP community. With this corpus in hand, we trained support vector machines (SVM) to correctly classify these phenomena with 0.68 F1, when using episode-fold cross-validation. This is nearly 100% higher F1 than the baseline classifier. The SVM models achieved performance of over 0.75 F1 on some testing folds. We report the results for SVM classifiers trained with four different types of features (verb classes, part of speech tags, named entities, and semantic role labels), and different machine learning protocols (under-sampling and trigram context). This work is grounded in narratology and computational models of narrative. It is useful for extracting events, plot, and story content from spoken dialogue.</abstract>
       <url hash="07fbff24">2020.nuse-1.8</url>
       <doi>10.18653/v1/2020.nuse-1.8</doi>
+      <attachment type="Dataset" hash="fed4910f">2020.nuse-1.8.Dataset.zip</attachment>
     </paper>
     <paper id="9">
       <title>Annotating and quantifying narrative time disruptions in modernist and hypertext fiction</title>
diff --git a/data/xml/2020.socialnlp.xml b/data/xml/2020.socialnlp.xml
index d000301f17..04fbd37197 100644
--- a/data/xml/2020.socialnlp.xml
+++ b/data/xml/2020.socialnlp.xml
@@ -66,6 +66,7 @@
       <abstract>We investigate whether pre-trained bidirectional transformers with sentiment and emotion information improve stance detection in long discussions of contemporary issues. As a part of this work, we create a novel stance detection dataset covering 419 different controversial issues and their related pros and cons collected by procon.org in nonpartisan format. Experimental results show that a shallow recurrent neural network with sentiment or emotion information can reach competitive results compared to fine-tuned BERT with 20x fewer parameters. We also use a simple approach that explains which input phrases contribute to stance detection.</abstract>
       <url hash="a130dcb9">2020.socialnlp-1.5</url>
       <doi>10.18653/v1/2020.socialnlp-1.5</doi>
+      <attachment type="Dataset" hash="0f5692cc">2020.socialnlp-1.5.Dataset.zip</attachment>
     </paper>
     <paper id="6">
       <title>Challenges in Emotion Style Transfer: An Exploration with a Lexical Substitution Pipeline</title>