Name		Name	Last commit message	Last commit date
parent directory ..
config		config
dataloader		dataloader
distributed		distributed
models		models
tests		tests
README.md		README.md
eval.py		eval.py
kvclient.py		kvclient.py
kvserver.py		kvserver.py
partition.py		partition.py
train.py		train.py
train_mxnet.py		train_mxnet.py
train_pytorch.py		train_pytorch.py
utils.py		utils.py

README.md

DGL - Knowledge Graph Embedding

Note: DGL-KE is moved to here. DGL-KE in this folder is deprecated.

Introduction

DGL-KE is a DGL-based package for computing node embeddings and relation embeddings of knowledge graphs efficiently. This package is adapted from KnowledgeGraphEmbedding. We enable fast and scalable training of knowledge graph embedding, while still keeping the package as extensible as KnowledgeGraphEmbedding. On a single machine, it takes only a few minutes for medium-size knowledge graphs, such as FB15k and wn18, and takes a couple of hours on Freebase, which has hundreds of millions of edges.

DGL-KE includes the following knowledge graph embedding models:

TransE (TransE_l1 with L1 distance and TransE_l2 with L2 distance)
DistMult
ComplEx
RESCAL
TransR
RotatE

It will add other popular models in the future.

DGL-KE supports multiple training modes:

CPU training
GPU training
Joint CPU & GPU training
Multiprocessing training on CPUs

For joint CPU & GPU training, node embeddings are stored on CPU and mini-batches are trained on GPU. This is designed for training KGE models on large knowledge graphs

For multiprocessing training, each process train mini-batches independently and use shared memory for communication between processes. This is designed to train KGE models on large knowledge graphs with many CPU cores.

We will support multi-GPU training and distributed training in a near future.

Requirements

The package can run with both Pytorch and MXNet. For Pytorch, it works with Pytorch v1.2 or newer. For MXNet, it works with MXNet 1.5 or newer.

Built-in Datasets

DGL-KE provides five built-in knowledge graphs:

Dataset	#nodes	#edges	#relations
FB15k	14951	592213	1345
FB15k-237	14541	310116	237
wn18	40943	151442	18
wn18rr	40943	93003	11
Freebase	86054151	338586276	14824

Users can specify one of the datasets with --dataset in train.py and eval.py.

Performance

The 1 GPU speed is measured with 8 CPU cores and one Nvidia V100 GPU. (AWS P3.2xlarge) The 8 GPU speed is measured with 64 CPU cores and eight Nvidia V100 GPU. (AWS P3.16xlarge)

The speed on FB15k 1GPU

Models	TransE_l1	TransE_l2	DistMult	ComplEx	RESCAL	TransR	RotatE
MAX_STEPS	48000	32000	40000	100000	32000	32000	20000
TIME	370s	270s	312s	282s	2095s	1556s	1861s

The accuracy on FB15k

Models	MR	MRR	HITS@1	HITS@3	HITS@10
TransE_l1	44.18	0.675	0.551	0.774	0.861
TransE_l2	46.71	0.665	0.551	0.804	0.846
DistMult	61.04	0.725	0.625	0.837	0.883
ComplEx	64.59	0.785	0.718	0.835	0.889
RESCAL	122.3	0.669	0.598	0.711	0.793
TransR	59.86	0.676	0.591	0.735	0.814
RotatE	43.66	0.728	0.632	0.801	0.874

The speed on FB15k 8GPU

Models	TransE_l1	TransE_l2	DistMult	ComplEx	RESCAL	TransR	RotatE
MAX_STEPS	6000	4000	5000	4000	4000	4000	2500
TIME	88.93s	62.99s	72.74s	68.37s	245.9s	203.9s	126.7s

The accuracy on FB15k

Models	MR	MRR	HITS@1	HITS@3	HITS@10
TransE_l1	44.25	0.672	0.547	0.774	0.860
TransE_l2	46.13	0.658	0.539	0.748	0.845
DistMult	61.72	0.723	0.626	0.798	0.881
ComplEx	65.84	0.754	0.676	0.813	0.880
RESCAL	135.6	0.652	0.580	0.693	0.779
TransR	65.27	0.676	0.591	0.736	0.811
RotatE	49.59	0.683	0.581	0.759	0.848

In comparison, GraphVite uses 4 GPUs and takes 14 minutes. Thus, DGL-KE trains TransE on FB15k 9.5X as fast as GraphVite with 8 GPUs. More performance information on GraphVite can be found here.

The speed on wn18 1GPU

Models	TransE_l1	TransE_l2	DistMult	ComplEx	RESCAL	TransR	RotatE
MAX_STEPS	32000	32000	20000	20000	20000	30000	24000
TIME	531.5s	406.6s	284.1s	282.3s	443.6s	766.2s	829.4s

The accuracy on wn18

Models	MR	MRR	HITS@1	HITS@3	HITS@10
TransE_l1	318.4	0.764	0.602	0.929	0.949
TransE_l2	206.2	0.561	0.306	0.800	0.944
DistMult	486.0	0.818	0.711	0.921	0.948
ComplEx	268.6	0.933	0.916	0.949	0.961
RESCAL	536.6	0.848	0.790	0.900	0.927
TransR	452.4	0.620	0.461	0.758	0.856
RotatE	487.9	0.944	0.940	0.947	0.952

The speed on wn18 8GPU

Models	TransE_l1	TransE_l2	DistMult	ComplEx	RESCAL	TransR	RotatE
MAX_STEPS	4000	4000	2500	2500	2500	2500	3000
TIME	119.3s	81.1s	76.0s	58.0s	594.1s	1168s	139.8s

The accuracy on wn18

Models	MR	MRR	HITS@1	HITS@3	HITS@10
TransE_l1	360.3	0.745	0.562	0.930	0.951
TransE_l2	193.8	0.557	0.301	0.799	0.942
DistMult	499.9	0.807	0.692	0.917	0.945
ComplEx	476.7	0.935	0.926	0.943	0.949
RESCAL	618.8	0.848	0.791	0.897	0.927
TransR	513.1	0.659	0.491	0.821	0.871
RotatE	466.2	0.944	0.940	0.945	0.951

The speed on Freebase (8 GPU)

Models	TransE_l2	DistMult	ComplEx	TransR	RotatE
MAX_STEPS	320000	300000	360000	300000	300000
TIME	7908s	7425s	8946s	16816s	12817s

The accuracy on Freebase (it is tested when 1000 negative edges are sampled for each positive edge).

Models	MR	MRR	HITS@1	HITS@3	HITS@10
TransE_l2	22.4	0.756	0.688	0.800	0.882
DistMul	45.4	0.833	0.812	0.843	0.872
ComplEx	48.0	0.830	0.812	0.838	0.864
TransR	51.2	0.697	0.656	0.716	0.771
RotatE	93.3	0.770	0.749	0.780	0.805

The speed on Freebase (48 CPU) This measured with 48 CPU cores on an AWS r5dn.24xlarge

Models	TransE_l2	DistMult	ComplEx
MAX_STEPS	50000	50000	50000
TIME	7002s	6340s	8133s

The accuracy on Freebase (it is tested when 1000 negative edges are sampled for each positive edge).

Models	MR	MRR	HITS@1	HITS@3	HITS@10
TransE_l2	30.8	0.814	0.764	0.848	0.902
DistMul	45.1	0.834	0.815	0.843	0.871
ComplEx	44.9	0.837	0.819	0.845	0.870

The configuration for reproducing the performance results can be found here.

Usage

DGL-KE doesn't require installation. The package contains two scripts train.py and eval.py.

train.py trains knowledge graph embeddings and outputs the trained node embeddings and relation embeddings.
eval.py reads the pre-trained node embeddings and relation embeddings and evaluate how accurate to predict the tail node when given (head, rel, ?), and predict the head node when given (?, rel, tail).

Input formats:

DGL-KE supports two knowledge graph input formats for user defined dataset

raw_udd_[h|r|t], raw user defined dataset. In this format, user only need to provide triples and let the dataloader generate and manipulate the id mapping. The dataloader will generate two files: entities.tsv for entity id mapping and relations.tsv for relation id mapping. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triples are stored in the order of tail, relation and head. It should contains three files:
- train stores the triples in the training set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
- valid stores the triples in the validation set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]
- test stores the triples in the test set. In format of a triple, e.g., [src_name, rel_name, dst_name] and should follow the order specified in [h|r|t]

Format 2:

udd_[h|r|t], user defined dataset. In this format, user should provide the id mapping for entities and relations. The order of head, relation and tail entities are described in [h|r|t], for example, raw_udd_trh means the triples are stored in the order of tail, relation and head. It should contains five files:
- entities stores the mapping between entity name and entity Id
- relations stores the mapping between relation name relation Id
- train stores the triples in the training set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
- valid stores the triples in the validation set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]
- test stores the triples in the test set. In format of a triple, e.g., [src_id, rel_id, dst_id] and should follow the order specified in [h|r|t]

Output formats:

To save the trained embeddings, users have to provide the path with --save_emb when running train.py. The saved embeddings are stored as numpy ndarrays.

The node embedding is saved as XXX_YYY_entity.npy.
The relation embedding is saved as XXX_YYY_relation.npy.

XXX is the dataset name and YYY is the model name.

Command line parameters

Here are some examples of using the training script.

Train KGE models with GPU.

python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
    --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
    --gpu 0 --max_step 40000

Train KGE models with mixed multiple GPUs.

python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
    --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
    --max_step 5000 --mix_cpu_gpu --num_proc 8 --gpu 0 1 2 3 4 5 6 7 --async_update \
    --soft_rel_part --force_sync_interval 1000

Train embeddings and verify it later.

python3 train.py --model DistMult --dataset FB15k --batch_size 1024 --neg_sample_size 256 \
    --hidden_dim 400 --gamma 143.0 --lr 0.08 --batch_size_eval 16 --valid --test -adv \
     --gpu 0 --max_step 40000 --save_emb DistMult_FB15k_emb

python3 eval.py --model_name DistMult --dataset FB15k --hidden_dim 400 \
    --gamma 143.0 --batch_size 16 --gpu 0 --model_path DistMult_FB15k_emb/

Train embeddings with multi-processing. This currently doesn't work in MXNet.

python3 train.py --model TransE_l2 --dataset Freebase --batch_size 1000 \
    --neg_sample_size 200 --hidden_dim 400 --gamma 10 --lr 0.1 --max_step 50000 \
    --log_interval 100 --batch_size_eval 1000 --neg_sample_size_eval 1000 --test \
   -adv --regularization_coef 1e-9 --num_thread 1 --num_proc 48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kg

kg

README.md

DGL - Knowledge Graph Embedding

Introduction

Requirements

Built-in Datasets

Performance

Usage

Input formats:

Output formats:

Command line parameters

Files

kg

Directory actions

More options

Directory actions

More options

Latest commit

History

kg

Folders and files

parent directory

README.md

DGL - Knowledge Graph Embedding

Introduction

Requirements

Built-in Datasets

Performance

Usage

Input formats:

Output formats:

Command line parameters