ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding
This repository contains the code and pre-trained models for our COLING 2022 paper ESimCSE: Enhanced Sample Building Method for Contrastive Learning of Unsupervised Sentence Embedding.
Thanks to SimCSE(Gao et al., EMNLP 2021)! Our work mainly based on SimCSE repo.
SimCSE adopts dropout as data augmentation and encodes an input sentence twice into two corresponding embeddings to build a positive pair. Since SimCSE is a Transformer-based encoder that directly encodes the length information of sentences through positional embeddings, the two embeddings in a positive pair contain the same length information.Thus, a model trained with these positive pairs is biased, tending to consider that sentences of the same or similar length are more similar in semantics.To alleviate it, we apply a simple but effective repetition operation to modify the input sentence. Then we pass the input sentence and its modified counterpart to the pre-trained Transformer encoder, respectively, to get the positive pair.Additionally, we draw inspiration from the computer vision community and introduce momentum contrast to enlarge the number of negative pairs without additional calculations. The proposed modifications are applied to positive and negative pairs separately, and build a new sentence embedding method, termed Enhanced SimCSE (ESimCSE). We evaluate the proposed ESimCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that ESimCSE outperforms SimCSE by an average Spearman correlation of 2.02% on BERT-base.
Our released models are listed as following. You can import these models by using HuggingFace's Transformers.
Model | Avg. STS |
---|---|
unsup-esimcse-bert-base-uncased | 78.27 |
unsup-esimcse-bert-large-uncased | 79.31 |
unsup-esimcse-roberta-base | 77.44 |
unsup-esimcse-roberta-large | 79.45 |
Besides using our provided sentence embedding tool, you can also easily import our models with HuggingFace's transformers
:
import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer
# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("ffgcc/esimcse-bert-base-uncased")
model = AutoModel.from_pretrained("ffgcc/esimcse-bert-base-uncased")
# Tokenize input texts
texts = [
"There's a kid on a skateboard.",
"A kid is skateboarding.",
"A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Get the embeddings
with torch.no_grad():
embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))
If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL})
.
In the following section, we describe how to train a ESimCSE model by using our code.
First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.7.1
version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.7.1
should also work. For example, if you use Linux and CUDA11 (how to check CUDA version), install PyTorch by the following command,
pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html
If you instead use CUDA <11
or CPU, install PyTorch by the following command,
pip install torch==1.7.1
Then run the following script to install the remaining dependencies,
pip install -r requirements.txt
Our evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation.
Before evaluation, please download the evaluation datasets by running
cd SentEval/data/downstream/
bash download_dataset.sh
Then come back to the root directory, you can evaluate any transformers
-based pre-trained models using our evaluation code. For example,
python evaluation.py \
--model_name_or_path ffgcc/esimcse-bert-base-uncased \
--pooler cls_before_pooler \
--task_set sts \
--mode test
which is expected to output the results in a tabular format:
------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 73.40 | 83.27 | 77.25 | 82.66 | 78.81 | 80.17 | 72.30 | 78.27 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
Arguments for the evaluation script are as follows,
--model_name_or_path
: The name or path of atransformers
-based pre-trained checkpoint. You can directly use the models in the above table, e.g.,ffgcc/esimcse-bert-base-uncased
.--pooler
: Pooling method. Now we supportcls
(default): Use the representation of[CLS]
token. A linear+activation layer is applied after the representation (it's in the standard BERT implementation).cls_before_pooler
: Use the representation of[CLS]
token without the extra linear+activation.avg
: Average embeddings of the last layer. If you use checkpoints of SBERT/SRoBERTa (paper), you should use this option.avg_top2
: Average embeddings of the last two layers.avg_first_last
: Average embeddings of the first and last layers. If you use vanilla BERT or RoBERTa, this works the best.
--mode
: Evaluation modetest
(default): The default test mode. To faithfully reproduce our results, you should use this option.dev
: Report the development set results. Note that in STS tasks, onlySTS-B
andSICK-R
have development sets, so we only report their numbers. It also takes a fast mode for transfer tasks, so the running time is much shorter than thetest
mode (though numbers are slightly lower).fasttest
: It is the same astest
, but with a fast mode so the running time is much shorter, but the reported numbers may be lower (only for transfer tasks).
--task_set
: What set of tasks to evaluate on (if set, it will override--tasks
)sts
(default): Evaluate on STS tasks, includingSTS 12~16
,STS-B
andSICK-R
. This is the most commonly-used set of tasks to evaluate the quality of sentence embeddings.transfer
: Evaluate on transfer tasks.full
: Evaluate on both STS and transfer tasks.na
: Manually set tasks by--tasks
.
--tasks
: Specify which dataset(s) to evaluate on. Will be overridden if--task_set
is notna
. See the code for a full list of tasks.
Data
Following SimCSE, we sample 1 million sentences from English Wikipedia; You can run data/download_wiki.sh
to download the two datasets.
Training scripts
We provide example training scripts for ESimCSE. In run.sh
, we provide a single-GPU (or CPU) example.
We explain the arguments in following:
--train_file
: Training file path. We support "txt" files (one line for one sentence) . You can use Wikipedia or you can use your own data with the same format.--model_name_or_path
: Pre-trained checkpoints to start with. For now we support BERT-based models (bert-base-uncased
,bert-large-uncased
, etc.) and RoBERTa-based models (RoBERTa-base
,RoBERTa-large
, etc.).--temp
: Temperature for the contrastive loss.--pooler_type
: Pooling method. It's the same as the--pooler_type
in the evaluation part.--hard_negative_weight
: If using hard negatives (i.e., there are 3 columns in the training file), this is the logarithm of the weight. For example, if the weight is 1, then this argument should be set as 0 (default value).--do_mlm
: Whether to use the MLM auxiliary objective. If True:--mlm_weight
: Weight for the MLM objective.--mlm_probability
: Masking rate for the MLM objective.
--neg_size
: The size of negative sentence.--dup_type
: the type of repetition, see more details in Section 5.2 of our paper.--dup_rate
: the rate of repetition, see more details in Section 3.1 of our paper.--momentum
: the rate of momentum, see more details in Section 3.2 of our paper.
All the other arguments are standard Huggingface's transformers
training arguments. Some of the often-used arguments are: --output_dir
, --learning_rate
, --per_device_train_batch_size
. In our example scripts, we also set to evaluate the model on the STS-B development set (need to download the dataset following the evaluation section) and save the best checkpoint.
For results in the paper, we use Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.
Convert models
Our saved checkpoints are slightly different from Huggingface's pre-trained checkpoints. Run python simcse_to_huggingface.py --path {PATH_TO_CHECKPOINT_FOLDER}
to convert it. After that, you can evaluate it by our evaluation code or directly use it out of the box.
If you have any questions related to the code or the paper, feel free to email Wuxing ([email protected]
) and Chaochen ([email protected]
). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Please cite our paper if you use ESimCSE in your work:
@article{wu2021esimcse,
title={Esimcse: Enhanced sample building method for contrastive learning of unsupervised sentence embedding},
author={Wu, Xing and Gao, Chaochen and Zang, Liangjun and Han, Jizhong and Wang, Zhongyuan and Hu, Songlin},
journal={arXiv preprint arXiv:2109.04380},
year={2021}
}