This repo contains code for SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation which was published in EMNLP 2022. SpaBERT provides a general-purpose geo-entity representation based on neighboring entities in geospatial data. SpaBERT extends BERT to capture linearized spatial context, while incorporating a spatial coordinate embedding mechanism to preserve spatial relations of entities in the 2-dimensional space. SpaBERT is pretrained with masked language modeling and masked entity prediction tasks to learn spatial dependencies.
- Slides: emnlp22-spabert.pdf
Pretrained model weights can be downloaded from the Google Drive for SpaBERT-base and SpaBERT-large.
Weights can also obtained from training from scratch using the following sample code. Data for pretraining can be downloaded here.
-
Code to pretrain SpaBERT-base model:
python3 train_mlm.py --lr=5e-5 --sep_between_neighbors --bert_option='bert-base'
-
Code to pretrain SpaBERT-large model:
python3 train_mlm.py --lr=1e-6 --sep_between_neighbors --bert_option='bert-large
The goal is to predict a geo-entity’s semantic type (e.g., transportation and healthcare) given the target geo-entity name and spatial context (i.e. surrounding neighbors name and location).
Models trained on OSM in London and California region can be downloaded from Google Drive for SpaBERT-base and SpaBERT-large
Data used for training and testing can be downloaded here
- Sample code for training SpaBERT-base typing model
python3 train_cls_spatialbert.py --lr=5e-5 --sep_between_neighbors --bert_option='bert-base' --with_type --mlm_checkpoint_path='mlm_mem_keeppos_ep0_iter06000_0.2936.pth'
- Sample code for training SpaBERT-large typing model
python3 train_cls_spatialbert.py --lr=1e-6 --sep_between_neighbors --bert_option='bert-large' --with_type --mlm_checkpoint_path='mlm_mem_keeppos_ep1_iter02000_0.4400.pth' --epochs=20
Geo-entity linking is to link geo-entities from a geographic information system (GIS) oriented dataset to a knowledge base (KB). This task unsupervised thus does not require any further training. Pretrained models can be directly used for this task.
Linking with SpaBERT-base
python3 unsupervised_wiki_location_allcand.py --model_name='spatial_bert-base' --sep_between_neighbors \
--spatial_bert_weight_dir='weights/' --spatial_bert_weight_name='mlm_mem_keeppos_ep0_iter06000_0.2936.pth'
Linking with SpaBERT-large
python3 unsupervised_wiki_location_allcand.py --model_name='spatial_bert-large' --sep_between_neighbors \
--spatial_bert_weight_dir='weights/' --spatial_bert_weight_name='mlm_mem_keeppos_ep1_iter02000_0.4400.pth'
Data used for linking from USGS historical maps to WikiData KB is provided here
@article{li2022spabert,
title={SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation},
author={Zekun Li, Jina Kim, Yao-Yi Chiang and Muhao Chen},
journal={EMNLP},
year={2022}
}