- Datset
- Machine Translation
- Machine Translation (Non-Autoregressive)
- Machine Translation (Low-Resource)
- Model Compression
- Attention
- Transformers
- Training Tips for Transformers
- Explaination
- Rich Answer Type
- Optimizer
- Text Attribute Transfer
- Layer Analysis
- Pre-Finetunning
- January 2021: An Efficient Transformer Decoder with Compressed Sub-layers
- December 2020: Train Once, and Decode As You Like
- November 2020: Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling
- October 2020: Nearest Neighbor Machine Translation
- October 2020: Inference Strategies for Machine Translation with Conditional Masking
- October 2020: Multi-task Learning for Multilingual Neural Machine Translation
- September 2020: Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models
- September 2020: Softmax Tempering for Training Neural Machine Translation Models
- September 2020: CSP: Code-Switching Pre-training for Neural Machine Translation
- June 2020: Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation
- November 2020: Context-Aware Cross-Attention for Non-Autoregressive Translation
- April 2020: Non-Autoregressive Machine Translation with Latent Alignments
- October 2020: Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation
- October 2020: Improving Target-side Lexical Transfer in Multilingual Neural Machine Translation
- September 2020: Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages
- October 2020: Adversarial Self-Supervised Data-Free Distillation for Text Classification
- October 2020: Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models
- September 2020: Contrastive Distillation on Intermediate Representations for Language Model Compression
- September 2020: Weight Distillation: Transferring the Knowledge in Neural Network Parameters
- June 2020: SqueezeBERT: What can computer vision teach NLP about efficient neural networks?
- February 2020: BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
- February 2020: Compressing Large-Scale Transformer-Based Models: A Case Study on BERT
- October 2020: Long Document Ranking with Query-Directed Sparse Transformer
- October 2020: SMYRF: Efficient Attention using Asymmetric Clustering
- October 2020: Improving Attention Mechanism with Query-Value Interaction
- October 2020: Guiding Attention for Self-Supervised Learning with Transformers
- September 2020: An Attention Free Transformer
- September 2020: Sparsifying Transformer Models with Differentiable Representation Pooling
- September 2020: Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
- June 2020: Limits to Depth Efficiencies of Self-Attention
- May 2020: Hard-Coded Gaussian Attention for Neural Machine Translation
- November 2019: Location Attention for Extrapolation to Longer Sequences
- November 2020: Long Range Arena: A Benchmark for Efficient Transformers
- November 2020: Colorization Transformer
- October 2020: N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using Neural Ordinary Differential Equations
- August 2020: DeLighT: Very Deep and Light-weight Transformer
- April 2020: Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning
- May 2019: Unified Language Model Pre-training for Natural Language Understanding and Generation
-November 2020: CharBERT: Character-aware Pre-trained Language Model
- October 2020: Long Document Ranking with Query-Directed Sparse Transformer
- June 2020: Progressive Generation of Long Text
- November 2020: Detecting Word Sense Disambiguation Biases in Machine Translation for Model-Agnostic Adversarial Attacks
- October 2020: On Losses for Modern Language Models
- October 2020: Cross-Thought for Sentence Encoder Pre-training
- October 2020: VECO: Variable Encoder-decoder Pre-training for Cross-lingual Understanding and Generation
- September 2019: You Only Train Once: Loss-Conditional Training of Deep Networks
- October 2020: Explaining and Improving Model Behavior with k Nearest Neighbor Representations
- April 2020: Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms
- September 2019: Learning to Deceive with Attention-Based Explanations
- September 2020:No Answer is Better Than Wrong Answer: A Reflection Model for Document Level Machine Reading Comprehension
- September 2020:Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
- November 2020:Deep Learning for Text Attribute Transfer: A Survey
- October 2020 :Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
- An Attention Free Transformer