Skip to content

MetiBeyene/TBert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tBERT

Alt text

This repository provides code for the paper "tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection" (https://www.aclweb.org/anthology/2020.acl-main.630/).

Setup

Download pretrained BERT

  • Download pretrained BERT model and unzip:
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip

Download preprocessed data

  • Go to the tBERT repository:
cd /path/to/tBERT/
  • Download topic models and original datasets from dropbox:
wget "https://www.dropbox.com/s/6p26mkwv62677zt/original_data.tar.gz"
  • Your tBERT directory should now have the following content (the Semeval folder is empty because data will be automatically downloaded by the script):
.
├── data
│   ├── cache
│   ├── logs
│   ├── models
│   ├── MSRP
│   │   └── MSRParaphraseCorpus
│   │       ├── msr-para-test.tsv
│   │       ├── msr-para-train.tsv
│   │       └── msr-para-val.tsv
│   ├── Quora
│   │   └── Quora_question_pair_partition
│   │       ├── dev.tsv
│   │       ├── test.tsv
│   │       └── train.tsv
│   ├── Semeval
│   └── topic_models
│       ├── basic
│       │   ├── MSRP_alpha1_80
│       │   ├── Quora_alpha1_90
│       │   ├── Semeval_alpha10_70
│       │   ├── Semeval_alpha10_80
│       │   └── Semeval_alpha50_70
│       ├── basic_gsdmm
│       │   ├── MSRP_alpha0.1_80
│       │   ├── Quora_alpha0.1_90
│       │   ├── Semeval_alpha0.1_70
│       │   └── Semeval_alpha0.1_80
│       └── mallet-2.0.8.zip
└── src
    ├── evaluation
    │   ├── difficulty
    │   ├── metrics
    ├── experiments
    ├── loaders
    │   ├── MSRP
    │   ├── PAWS
    │   ├── Quora
    │   └── Semeval
    ├── logs
    ├── models
    │   ├── forward
    │   ├── helpers
    ├── preprocessing
    ├── ShortTextTopic
    │   ├── gsdmm
    │   └── test
    └── topic_model

Requirements

  • This code has been tested with Python 3.6 and Tensorflow 1.11.
  • Install the required Python packages as defined in requirements.txt:
 pip install -r requirements.txt

Usage

python src/models/base_model_bert.py
  • This should produce the following output:
['m_train_B', 'm_dev_B', 'm_test_B']
['data/MSRP/m_train_B.txt', 'data/MSRP/m_dev_B.txt', 'data/MSRP/m_test_B.txt']
data/cache/m_train_B.pickle
Loading cached input for m_train_B
data/cache/m_dev_B.pickle
Loading cached input for m_dev_B
data/cache/m_test_B.pickle
Loading cached input for m_test_B
Mapping words to BERT ids...
Finished word id mapping.
Done.
{'topic_type': 'ldamallet', 'load_ids': True, 'topic': 'doc', 'minibatch_size': 10, 'seed': 1, 'max_m': 10, 'bert_large': False, 'num_topics': 80, 'num_epochs': 1, 'model': 'bert_simple_topic', 'max_length': 'minimum', 'simple_padding': True, 'padding': False, 'bert_update': True, 'L2': 0, 'dropout': 0.1, 'bert_cased': False, 'speedup_new_layers': False, 'unk_topic': 'zero', 'stopping_criterion': 'F1', 'tasks': ['B'], 'learning_rate': 0.3, 'hidden_layer': 1, 'gpu': -1, 'optimizer': 'Adadelta', 'datapath': 'data/', 'unflat_topics': False, 'sparse_labels': True, 'freeze_thaw_tune': False, 'dataset': 'MSRP', 'topic_alpha': 1, 'predict_every_epoch': False, 'unk_sub': False, 'subsets': ['train', 'dev', 'test'], 'topic_update': True}
Topic scope: doc
input ids shape: (?, ?)
Loading pretrained model from https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1
---
Model: tBERT
---
D_T1 shape: (?, 80)
D_T2 shape: (?, 80)
pooled BERT shape: (?, 768)
combined shape: (?, 928)
hidden 1 shape: (?, 464)
output layer shape: (?, 2)
reading logs...
No file found at data/logs/test.json. Creating new log.
get new id
Model 0
2020-07-03 19:02:31.946114: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-03 19:02:37.224331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 6a05:00:00.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2020-07-03 19:02:37.224390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
logfile: test.json
Finetune...
Epoch 1
Dev F1 after epoch 1: 0.75
data/models/model_0/model_epoch1.ckpt
Maximum number of epochs reached during early stopping.
Finished training.
Load best model from epoch 1
reading logs...
Finished training after 0.55 min
Dev F1: 0.75
Test F1: 0.75
reading logs...
Wrote predictions for model_0.
  • The model will be saved under data/models/model_0/ and the training log is available under data/logs/test.json
  • You can also run an experiment on the complete dataset and alter different commandline flags, e.g.:
python src/experiments/tbert.py -dataset MSRP -layers 1 -topic doc -topic_type ldamallet -learning_rate 5e-5 --early_stopping -seed 3 -gpu 0
  • This should give you the following output:
Starting experiment 1 of 1
tbert_1_seed_early_stopping.json
{'dropout': 0.1, 'model': 'bert_simple_topic', 'bert_cased': False, 'max_m': None, 'tasks': ['B'], 'padding': False, 'dataset': 'MSRP',
 'L2': 0, 'subsets': ['train', 'dev', 'test'], 'unk_sub': False, 'hidden_layer': 1, 'datapath': 'data/', 'predict_every_epoch': False,
'num_epochs': 3, 'simple_padding': True, 'patience': 2, 'speedup_new_layers': False, 'minibatch_size': 32, 'max_length': 'minimum', 'lo
ad_ids': True, 'topic_type': 'ldamallet', 'unk_topic': 'uniform', 'topic_update': False, 'sparse_labels': True, 'num_topics': 80, 'topi
c_alpha': 1, 'seed': 1, 'gpu': 0, 'stopping_criterion': 'F1', 'bert_update': True, 'learning_rate': 3e-05, 'optimizer': 'Adam', 'topic'
: 'doc'}
['m_train_B', 'm_dev_B', 'm_test_B']
['data/MSRP/m_train_B.txt', 'data/MSRP/m_dev_B.txt', 'data/MSRP/m_test_B.txt']
data/cache/m_train_B.pickle
Loading cached input for m_train_B
data/cache/m_dev_B.pickle
Loading cached input for m_dev_B
data/cache/m_test_B.pickle
Loading cached input for m_test_B
Mapping words to BERT ids...
Finished word id mapping.
Done.
{'dropout': 0.1, 'model': 'bert_simple_topic', 'bert_cased': False, 'max_m': None, 'tasks': ['B'], 'padding': False, 'dataset': 'MSRP',
 'unflat_topics': False, 'L2': 0, 'subsets': ['train', 'dev', 'test'], 'unk_sub': False, 'hidden_layer': 1, 'datapath': 'data/', 'bert_
large': False, 'predict_every_epoch': False, 'num_epochs': 3, 'simple_padding': True, 'patience': 2, 'speedup_new_layers': False, 'mini
batch_size': 32, 'max_length': 'minimum', 'load_ids': True, 'topic_type': 'ldamallet', 'unk_topic': 'uniform', 'topic_update': False, '
sparse_labels': True, 'num_topics': 80, 'topic_alpha': 1, 'seed': 1, 'gpu': 0, 'stopping_criterion': 'F1', 'bert_update': True, 'learni
ng_rate': 3e-05, 'optimizer': 'Adam', 'topic': 'doc'}
Running on GPU: 0
Topic scope: doc
input ids shape: (?, ?)
Loading pretrained model from https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1
---
Model: tBERT
---
D_T1 shape: (?, 80)
D_T2 shape: (?, 80)
pooled BERT shape: (?, 768)
combined shape: (?, 928)
hidden 1 shape: (?, 464)
output layer shape: (?, 2)
reading logs...
get new id
Model 1
2020-07-03 19:51:34.629485: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow bi
nary was not compiled to use: AVX2 FMA
2020-07-03 19:51:39.501180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 6a05:00:00.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2020-07-03 19:51:39.501233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2020-07-03 19:51:39.769183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-03 19:51:39.769242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2020-07-03 19:51:39.769263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2020-07-03 19:51:39.769368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10761 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 6a05:00:00.0, compute capability: 3.7)
logfile: tbert_1_seed_early_stopping.json
Finetune...
Epoch 1
Dev F1 after epoch 1: 0.8917378783226013
data/models/model_1/model_epoch1.ckpt
Epoch 2
Dev F1 after epoch 2: 0.9058663249015808
data/models/model_1/model_epoch2.ckpt
Epoch 3
Dev F1 after epoch 3: 0.8959999680519104
Maximum number of epochs reached during early stopping.
Finished training.
Load best model from epoch 2
reading logs...
Finished training after 10.74 min
Dev F1: 0.9059
Test F1: 0.8841
reading logs...
Wrote predictions for model_1.
  • run the LDA pridictions
  • cd src/LDA
    run LDA.py
    refer output in 10.html
    

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published