Skip to content

Latest commit

 

History

History
 
 

kg

DGL - Knowledge Graph Embedding

Introduction

DGL-KE is a DGL-based package for computing node embeddings and relation embeddings of knowledge graphs efficiently. This package is adapted from KnowledgeGraphEmbedding. We enable fast and scalable training of knowledge graph embedding, while still keeping the package as extensible as KnowledgeGraphEmbedding. On a single machine, it takes only a few minutes for medium-size knowledge graphs, such as FB15k and wn18, and takes a couple of hours on Freebase, which has hundreds of millions of edges.

DGL-KE includes the following knowledge graph embedding models:

  • TransE (TransE_l1 with L1 distance and TransE_l2 with L2 distance)
  • DistMult
  • ComplEx
  • RESCAL
  • TransR
  • RotatE

It will add other popular models in the future.

DGL-KE supports multiple training modes:

  • CPU training
  • GPU training
  • Joint CPU & GPU training
  • Multiprocessing training on CPUs

For joint CPU & GPU training, node embeddings are stored on CPU and mini-batches are trained on GPU. This is designed for training KGE models on large knowledge graphs

For multiprocessing training, each process train mini-batches independently and use shared memory for communication between processes. This is designed to train KGE models on large knowledge graphs with many CPU cores.

We will support multi-GPU training and distributed training in a near future.

Requirements

The package can run with both Pytorch and MXNet. For Pytorch, it works with Pytorch v1.2 or newer. For MXNet, it works with MXNet 1.5 or newer.

Built-in Datasets

DGL-KE provides five built-in knowledge graphs:

Dataset #nodes #edges #relations
FB15k 14951 592213 1345
FB15k-237 14541 310116 237
wn18 40943 151442 18
wn18rr 40943 93003 11
Freebase 86054151 338586276 14824

Users can specify one of the datasets with --dataset in train.py and eval.py.

Performance

The speed is measured with 16 CPU cores and one Nvidia V100 GPU.

The speed on FB15k

Models TransE_l1 TransE_l2 DistMult ComplEx RESCAL TransR RotatE
MAX_STEPS 20000 30000 100000 100000 30000 100000 100000
TIME 411s 329s 690s 806s 1800s 7627s 4327s

The accuracy on FB15k

Models MR MRR HITS@1 HITS@3 HITS@10
TransE_l1 69.12 0.656 0.567 0.718 0.802
TransE_l2 35.86 0.570 0.400 0.708 0.834
DistMult 43.35 0.783 0.713 0.837 0.897
ComplEx 51.99 0.785 0.720 0.832 0.889
RESCAL 130.89 0.668 0.597 0.720 0.800
TransR 138.7 0.501 0.274 0.704 0.801
RotatE 39.6 0.725 0.628 0.802 0.875

In comparison, GraphVite uses 4 GPUs and takes 14 minutes. Thus, DGL-KE trains TransE on FB15k twice as fast as GraphVite while using much few resources. More performance information on GraphVite can be found here.

The speed on wn18

Models TransE_l1 TransE_l2 DistMult ComplEx RESCAL TransR RotatE
MAX_STEPS 40000 20000 10000 20000 20000 20000 20000
TIME 719s 254s 126s 266s 333s 1547s 786s

The accuracy on wn18

Models MR MRR HITS@1 HITS@3 HITS@10
TransE_l1 321.35 0.760 0.652 0.850 0.940
TransE_l2 181.57 0.570 0.322 0.802 0.944
DistMult 271.09 0.769 0.639 0.892 0.949
ComplEx 276.37 0.935 0.916 0.950 0.960
RESCAL 579.54 0.846 0.791 0.898 0.931
TransR 615.56 0.606 0.378 0.826 0.890
RotatE 367.64 0.931 0.924 0.935 0.944

The speed on Freebase

Models DistMult ComplEx
MAX_STEPS 3200000 3200000
TIME 2.44h 2.94h

The accuracy on Freebase (it is tested when 100,000 negative edges are sampled for each positive edge).

Models MR MRR HITS@1 HITS@3 HITS@10
DistMul 6159.1 0.716 0.690 0.729 0.760
ComplEx 6888.8 0.716 0.697 0.728 0.760

The configuration for reproducing the performance results can be found here.

Usage

DGL-KE doesn't require installation. The package contains two scripts train.py and eval.py.

  • train.py trains knowledge graph embeddings and outputs the trained node embeddings and relation embeddings.

  • eval.py reads the pre-trained node embeddings and relation embeddings and evaluate how accurate to predict the tail node when given (head, rel, ?), and predict the head node when given (?, rel, tail).

Input formats:

DGL-KE supports two knowledge graph input formats for user defined dataset

  • raw_udd_[h|r|t], raw user defined dataset. In this format, user only need to provide triples and let the dataloader generate and manipulate the id mapping. The dataloader will generate two files: entities.tsv for entity id mapping and relations.tsv for relation id mapping. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triples are stored in the order of tail, relation and head. It should contains three files:
    • train stores the triples in the training set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
    • valid stores the triples in the validation set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
    • test stores the triples in the test set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]

Format 2:

  • udd_[h|r|t], user defined dataset. In this format, user should provide the id mapping for entities and relations. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triples are stored in the order of tail, relation and head. It should contains five files:
    • entities stores the mapping between entity name and entity Id
    • relations stores the mapping between relation name relation Id
    • train stores the triples in the training set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
    • valid stores the triples in the validation set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
    • test stores the triples in the test set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]

Output formats:

To save the trained embeddings, users have to provide the path with --save_emb when running train.py. The saved embeddings are stored as numpy ndarrays.

  • The node embedding is saved as XXX_YYY_entity.npy.

  • The relation embedding is saved as XXX_YYY_relation.npy.

XXX is the dataset name and YYY is the model name.

Command line parameters

Here are some examples of using the training script.

Train KGE models with GPU.

python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
    --batch_size_eval 16 --gpu 0 --valid --test -adv

Train KGE models with mixed CPUs and GPUs.

python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
    --batch_size_eval 16 --gpu 0 --valid --test -adv --mix_cpu_gpu

Train embeddings and verify it later.

python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.1 --max_step 100000 \
    --batch_size_eval 16 --gpu 0 --valid -adv --save_emb DistMult_FB15k_emb

python3 eval.py --model_name DistMult --dataset FB15k --hidden_dim 2000 \
    --gamma 500.0 --batch_size 16 --gpu 0 --model_path DistMult_FB15k_emb/

Train embeddings with multi-processing. This currently doesn't work in MXNet.

python3 train.py --model DistMult --dataset FB15k --batch_size 1024 \
    --neg_sample_size 256 --hidden_dim 2000 --gamma 500.0 --lr 0.07 --max_step 3000 \
    --batch_size_eval 16 --regularization_coef 0.000001 --valid --test -adv --num_proc 8