A curated list of pretrained sentence(and word) embedding models
- About This Repo
- General Framework
- Word Embeddings
- OOV Handling
- Contextualized Word Embeddings
- Pooling Methods
- Encoders
- Evaluation
- Misc
- Vector Mapping
- Articles
- Code Less
- well there are some awesome-lists for word embeddings and sentence embeddings, but all of them are outdated and more importantly incomplete
- this repo will also be incomplete, but I try my best to find and include all the papers with pretrained models
- this is not a typical awesome list because it has tables but I guess it's ok and much better than just a huge list
- if you find any mistakes or find another paper or anything please send a pull request and help me to keep this list up to date
- to be honest I'm not 100% sure how to represent this data and if you think there is a better way (for example by changing the table headers) please send a pull request and let us discuss it
- enjoy!
- Almost all the sentence embeddings work like this:
- Given some sort of word embeddings and an optional encoder (for example an LSTM) they obtain the contextualized word embeddings.
- Then they define some sort of pooling (it can be as simple as last pooling).
- Based on that they either use it directly for the supervised classification task (like infersent) or generate the target sequence (like skip-thought).
- So, in general, we have many sentence embeddings that you have never heard of, you can simply do mean-pooling over any word embedding and it's a sentence embedding!
- Note: don't worry about the language of the code, you can almost always (except for the subword models) just use the pretrained embedding table in the framework of your choice and ignore the training code
- Drop OOV words!
- One OOV vector(unk vector)
- ALaCarte: A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
- Mimick: Mimicking Word Embeddings using Subword RNNs
- Note: all the unofficial models can load the official pretrained models
- {Last, Mean, Max}-Pooling
- Special Token Pooling (like BERT and OpenAI's Transformer)
- SIF: A Simple but Tough-to-Beat Baseline for Sentence Embeddings
- TF-IDF: Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF
- P-norm: Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations
- DisC: A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs
- decaNLP: The Natural Language Decathlon: Multitask Learning as Question Answering
- SentEval: SentEval: An Evaluation Toolkit for Universal Sentence Representations
- GLUE: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
- Exploring Semantic Properties of Sentence Embeddings
- Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
- Word Embeddings Benchmarks: How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks
- MLDoc: A Corpus for Multilingual Document Classification in Eight Languages
- LexNET: Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model
- wordvectors.net: Community Evaluation and Exchange of Word Vectors at wordvectors.org
- jiant
- Evaluation of sentence embeddings in downstream and linguistic probing tasks
- QVEC: Evaluation of Word Vector Representations by Subspace Alignment
- Cross-lingual Word Vectors Projection Using CCA: Improving Vector Space Word Representations Using Multilingual Correlation
- vecmap: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
- MUSE: Unsupervised Machine Translation Using Monolingual Corpora Only
- Comparing Sentence Similarity Methods
- The Current Best of Universal Word Embeddings and Sentence Embeddings
- On sentence representations, pt. 1: what can you fit into a single #$!%@*&% blog post?
- Deep-learning-free Text and Sentence Embedding, Part 1
- Deep-learning-free Text and Sentence Embedding, Part 2
- An Overview of Sentence Embedding Methods
- Word embeddings in 2017: Trends and future directions
- A Walkthrough of InferSent – Supervised Learning of Sentence Embeddings
- A survey of cross-lingual word embedding models
- Introducing state of the art text classification with universal language models
- papers here are just a paper and they don't have any released codes or pretrained models
- are you sure? I have read the paper, googled the title, googled the title + github, and searched for the authors one by one and checked their pages, so yeah I'm 60% sure that they don't have anything! :))
- I did these two months ago(Oct 2018), and they might have released their codes in this time, so If you found any of them let me know.
- TOWARDS LANGUAGE AGNOSTIC UNIVERSAL REPRESENTATIONS
- IS WASSERSTEIN ALL YOU NEED?
- Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning
- UNSUPERVISED SENTENCE EMBEDDING USING DOCUMENT STRUCTURE-BASED CONTEXT
- CSE: Conceptual Sentence Embeddings based on Attention Model
- Unsupervised Document Embedding With CNNs
- Learning Generic Sentence Representations Using Convolutional Neural Networks
- Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model
- Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling
- ZERO-TRAINING SENTENCE EMBEDDING VIA ORTHOGONAL BASIS
- Improving Sentence Representations with Multi-view Frameworks
- UNSUPERVISED LEARNING OF SENTENCE REPRESENTATIONS USING SEQUENCE CONSISTENCY
- FAKE SENTENCE DETECTION AS A TRAINING TASK FOR SENTENCE ENCODING
- POINCARE´ GLOVE: HYPERBOLIC WORD EMBEDDINGS
- A NON-LINEAR THEORY FOR SENTENCE EMBEDDING
- NO TRAINING REQUIRED: EXPLORING RANDOM ENCODERS FOR SENTENCE CLASSIFICATION
- VARIATIONAL AUTOENCODERS FOR TEXT MODELING WITHOUT WEAKENING THE DECODER
- IMPROVING COMPOSITION OF SENTENCE EMBEDDINGS THROUGH THE LENS OF STATISTICAL RELATIONAL LEARNING