InferSent is a sentence embeddings method that provides semantic sentence representations. It is trained on natural language inference data and generalizes well to many different tasks.
We provide our pre-trained sentence encoder for reproducing the results from our paper. See also SentEval for automatic evaluation of the quality of sentence embeddings.
This code is written in python. The dependencies are:
To get GloVe, SNLI and MultiNLI [2GB, 90MB, 216MB], run (in dataset/):
./get_data.bash
This will download GloVe and preprocess SNLI/MultiNLI data/senteval_data.
See encoder/play.ipynb for an example.
0) Download our model trained on AllNLI (SNLI and MultiNLI) [147MB]:
curl -Lo encoder/infersent.allnli.pickle https://s3.amazonaws.com/senteval/infersent/infersent.allnli.pickle
1) Load our pre-trained model (in encoder/):
import torch
infersent = torch.load('infersent.allnli.pickle')
Note: to load it, you need the file "models.py" (in encoder/) that provides the definition of the model.
2) Set GloVe path for the model:
infersent.set_glove_path(glove_path)
where glove_path is the path to 'glove.840B.300d.txt', containing glove vectors with which our model was trained. Note that using GloVe vectors allows to have a coverage of more than 2 million english words.
3) Build the vocabulary of word vectors (i.e keep only those needed):
infersent.build_vocab(sentences, tokenize=True)
where sentences is your list of n sentences. You can update your vocabulary using infersent.update_vocab(sentences), or directly load the K most common words with infersent.build_vocab_k_words(K=100000). If tokenize is True (by default), sentences will be tokenized using NTLK. Use nltk.download('punkt') once to download the NLTK tokenizer.
4) Encode your sentences (list of n sentences):
infersent.encode(sentences, tokenize=True)
This will output an numpy array with n vectors of dimension 4096 (dimension of the sentence embeddings). Speed is around 1000 sentences per second with batch size 128 on a single GPU.
5) Visualize the importance that our model attributes to each word:
Our representations were trained to focus on semantic information such that a classifier can easily tell the difference between contradictory, neutral or entailed sentences. We provide a function to visualize the importance of each word in the encoding of a sentence:
infersent.visualize('A man plays an instrument.', tokenize=True)
To reproduce our results and train our models on SNLI, set GLOVE_PATH in train_nli.py, then run:
python train_nli.py
You should obtain a dev accuracy of 85 and a test accuracy of 84.5 with the default setting.
To reproduce our results on transfer tasks, clone SentEval and set PATH_SENTEVAL, PATH_TRANSFER_TASKS in evaluate_model.py, then run:
python evaluate_model.py
Using our best model infersent.allnli.pickle, you should obtain the following test results:
Model | MR | CR | SUBJ | MPQA | STS14 | STS Benchmark | SICK Relatedness | SICK Entailment | SST | TREC | MRPC |
---|---|---|---|---|---|---|---|---|---|---|---|
InferSent |
81.1 | 86.3 | 92.4 | 90.2 | .68/.65 | 75.8/75.5 | 0.884 | 86.1 | 84.6 | 88.2 | 76.2/83.1 |
SkipThought |
79.4 | 83.1 | 93.7 | 89.3 | .44/.45 | 72.1/70.2 | 0.858 | 79.5 | 82.9 | 88.4 | - |
Note that while InferSent provides good features for many different tasks, our approach also obtains strong results on STS tasks which evaluate the quality of the cosine metrics in the embedding space.
Please cite 1 if you found this code useful.
[1] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
@article{conneau2017supervised,
title={Supervised Learning of Universal Sentence Representations from Natural Language Inference Data},
author={Conneau, Alexis and Kiela, Douwe and Schwenk, Holger and Barrault, Loic and Bordes, Antoine},
journal={arXiv preprint arXiv:1705.02364},
year={2017}
}
Contact: [email protected]