Skip to content

Commit

Permalink
[Distributed] Heterogeneous graph support (dmlc#2457)
Browse files Browse the repository at this point in the history
* Distributed heterograph (dmlc#3)

* heterogeneous graph partition.

* fix graph partition book for heterograph.

* load heterograph partitions.

* update DistGraphServer to support heterograph.

* make DistGraph runnable for heterograph.

* partition a graph and store parts with homogeneous graph structure.

* update DistGraph server&client to use homogeneous graph.

* shuffle node Ids based on node types.

* load mag in heterograph.

* fix per-node-type mapping.

* balance node types.

* fix for homogeneous graph

* store etype for now.

* fix data name.

* fix a bug in example.

* add profiler in rgcn.

* heterogeneous RGCN.

* map homogeneous node ids to hetero node ids.

* fix graph partition book.

* fix DistGraph.

* shuffle eids.

* verify eids and their mappings when loading a partition.

* Id map from homogneous Ids to per-type Ids.

* verify partitioned results.

* add test for distributed sampler.

* add mapping from per-type Ids to homogeneous Ids.

* update example.

* fix DistGraph.

* Revert "add profiler in rgcn."

This reverts commit 36daaed8b660933dac8f61a39faec3da2467d676.

* add tests for homogeneous graphs.

* fix a bug.

* fix test.

* fix for one partition.

* fix for standalone training and evaluation.

* small fix.

* fix two bugs.

* initialize projection matrix.

* small fix on RGCN.

* Fix rgcn performance (dmlc#17)

Co-authored-by: Ubuntu <[email protected]>

* fix lint.

* fix lint.

* fix lint.

* fix lint.

* fix lint.

* fix lint.

* fix.

* fix test.

* fix lint.

* test partitions.

* remove redundant test for partitioning.

* remove commented code.

* fix partition.

* fix tests.

* fix RGCN.

* fix test.

* fix test.

* fix test.

* fix.

* fix a bug.

* update dmlc-core.

* fix.

* fix rgcn.

* update readme.

* add comments.

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: xiang song(charlie.song) <[email protected]>
Co-authored-by: Ubuntu <[email protected]>

* fix.

* fix.

* add div_int.

* fix.

* fix.

* fix lint.

* fix.

* fix.

* fix.

* adjust.

* move code.

* handle heterograph.

* return pytorch tensor in GPB.

* remove some tests in example.

* add to_block for distributed training.

* use distributed to_block.

* remove unnecessary function in DistGraph.

* remove distributed to_block.

* use pytorch tensor.

* fix a bug in ntypes and etypes.

* enable norm.

* make the data loader compatible with the old format.

* fix.

* add comments.

* fix a bug.

* add test for heterograph.

* support partition without reshuffle.

* add test.

* support partition without reshuffle.

* fix.

* add test.

* fix bugs.

* fix lint.

* fix dataset.

* fix for mxnet.

* update docstring.

* rename to floor_div

* avoid exposing NodePartitionPolicy and EdgePartitionPolicy.

* fix docstring.

* fix error.

* fixes.

* fix comments.

* rename.

* rename.

* explain IdMap.

* fix docstring.

* fix docstring.

* update docstring.

* remove the code of returning heterograph.

* remove argument.

* fix example.

* make GraphPartitionBook an abstract class.

* fix.

* fix.

* fix a bug.

* fix a bug in example

* fix a bug

* reverse heterograph sampling.

* temp fix.

* fix lint.

* Revert "temp fix."

This reverts commit c450717.

* compute norm.

* Revert "reverse heterograph sampling."

This reverts commit bd6deb7.

* fix.

* move id_map.py

* remove check

* add more comments.

* update docstring.

Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: xiang song(charlie.song) <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
  • Loading branch information
5 people authored Jan 25, 2021
1 parent aa884d4 commit 25ac334
Show file tree
Hide file tree
Showing 29 changed files with 2,294 additions and 601 deletions.
2 changes: 2 additions & 0 deletions examples/pytorch/graphsage/experimental/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ the number of nodes, the number of edges and the number of labelled nodes.
python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges
```

This script generates partitioned graphs and store them in the directory called `data`.

### Step 2: copy the partitioned data and files to the cluster

DGL provides a script for copying partitioned data and files to the cluster. Before that, copy the training script to a local folder:
Expand Down
8 changes: 5 additions & 3 deletions examples/pytorch/rgcn/experimental/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Distributed training

This is an example of training RGCN node classification in a distributed fashion. Currently, the example only support training RGCN graphs with no input features. The current implementation follows ../rgcn/entity_claasify_mp.py.
This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features. The current implementation follows ../rgcn/entity_claasify_mp.py.

Before training, please install some python libs by pip:

Expand Down Expand Up @@ -36,6 +36,8 @@ the number of nodes, the number of edges and the number of labelled nodes.
python3 partition_graph.py --dataset ogbn-mag --num_parts 4 --balance_train --balance_edges
```

This script generates partitioned graphs and store them in the directory called `data`.

### Step 2: copy the partitioned data to the cluster
DGL provides a script for copying partitioned data to the cluster. Before that, copy the training script to a local folder:

Expand Down Expand Up @@ -78,7 +80,7 @@ python3 ~/dgl/tools/launch.py \
--num_samplers 4 \
--part_config data/ogbn-mag.json \
--ip_config ip_config.txt \
"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 16 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --num-workers 4 --num-servers 1 --sparse-embedding --sparse-lr 0.06"
"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 16 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --num-workers 4 --num-servers 1 --sparse-embedding --sparse-lr 0.06 --node-feats"
```

We can get the performance score at the second epoch:
Expand All @@ -98,5 +100,5 @@ python3 partition_graph.py --dataset ogbn-mag --num_parts 1

### Step 2: run the training script
```bash
python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 256 --n-hidden 64 --lr 0.01 --eval-batch-size 8 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone
python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 128 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone --sparse-embedding --sparse-lr 0.06 --node-feats
```
206 changes: 123 additions & 83 deletions examples/pytorch/rgcn/experimental/entity_classify_dist.py

Large diffs are not rendered by default.

60 changes: 14 additions & 46 deletions examples/pytorch/rgcn/experimental/partition_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from ogb.nodeproppred import DglNodePropPredDataset

def load_ogb(dataset, global_norm):
def load_ogb(dataset):
if dataset == 'ogbn-mag':
dataset = DglNodePropPredDataset(name=dataset)
split_idx = dataset.get_idx_split()
Expand All @@ -33,54 +33,24 @@ def load_ogb(dataset, global_norm):
print('Number of valid: {}'.format(len(val_idx)))
print('Number of test: {}'.format(len(test_idx)))

# currently we do not support node feature in mag dataset.
# calculate norm for each edge type and store in edge
if global_norm is False:
for canonical_etype in hg.canonical_etypes:
u, v, eid = hg.all_edges(form='all', etype=canonical_etype)
_, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
degrees = count[inverse_index]
norm = th.ones(eid.shape[0]) / degrees
norm = norm.unsqueeze(1)
hg.edges[canonical_etype].data['norm'] = norm

# get target category id
category_id = len(hg.ntypes)
for i, ntype in enumerate(hg.ntypes):
if ntype == category:
category_id = i

g = dgl.to_homogeneous(hg, edata=['norm'])
if global_norm:
u, v, eid = g.all_edges(form='all')
_, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
degrees = count[inverse_index]
norm = th.ones(eid.shape[0]) / degrees
norm = norm.unsqueeze(1)
g.edata['norm'] = norm

node_ids = th.arange(g.number_of_nodes())
# find out the target node ids
node_tids = g.ndata[dgl.NTYPE]
loc = (node_tids == category_id)
target_idx = node_ids[loc]
train_idx = target_idx[train_idx]
val_idx = target_idx[val_idx]
test_idx = target_idx[test_idx]
train_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
train_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
train_mask[train_idx] = True
val_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
val_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
val_mask[val_idx] = True
test_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
test_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
test_mask[test_idx] = True
g.ndata['train_mask'] = train_mask
g.ndata['val_mask'] = val_mask
g.ndata['test_mask'] = test_mask
hg.nodes['paper'].data['train_mask'] = train_mask
hg.nodes['paper'].data['val_mask'] = val_mask
hg.nodes['paper'].data['test_mask'] = test_mask

labels = th.full((g.number_of_nodes(),), -1, dtype=paper_labels.dtype)
labels[target_idx] = paper_labels
g.ndata['labels'] = labels
return g
hg.nodes['paper'].data['labels'] = paper_labels
return hg
else:
raise("Do not support other ogbn datasets.")

Expand All @@ -98,21 +68,19 @@ def load_ogb(dataset, global_norm):
help='turn the graph into an undirected graph.')
argparser.add_argument('--balance_edges', action='store_true',
help='balance the number of edges in each partition.')
argparser.add_argument('--global-norm', default=False, action='store_true',
help='User global norm instead of per node type norm')
args = argparser.parse_args()

start = time.time()
g = load_ogb(args.dataset, args.global_norm)
g = load_ogb(args.dataset)

print('load {} takes {:.3f} seconds'.format(args.dataset, time.time() - start))
print('|V|={}, |E|={}'.format(g.number_of_nodes(), g.number_of_edges()))
print('train: {}, valid: {}, test: {}'.format(th.sum(g.ndata['train_mask']),
th.sum(g.ndata['val_mask']),
th.sum(g.ndata['test_mask'])))
print('train: {}, valid: {}, test: {}'.format(th.sum(g.nodes['paper'].data['train_mask']),
th.sum(g.nodes['paper'].data['val_mask']),
th.sum(g.nodes['paper'].data['test_mask'])))

if args.balance_train:
balance_ntypes = g.ndata['train_mask']
balance_ntypes = {'paper': g.nodes['paper'].data['train_mask']}
else:
balance_ntypes = None

Expand Down
16 changes: 16 additions & 0 deletions python/dgl/backend/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,22 @@ def sum(input, dim, keepdims=False):
"""
pass

def floor_div(in1, in2):
"""Element-wise integer division and rounds each quotient towards zero.
Parameters
----------
in1 : Tensor
The input tensor
in2 : Tensor or integer
The input
Returns
-------
Tensor
A framework-specific tensor.
"""

def reduce_sum(input):
"""Returns the sum of all elements in the input tensor.
Expand Down
3 changes: 3 additions & 0 deletions python/dgl/backend/mxnet/tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,9 @@ def sum(input, dim, keepdims=False):
return nd.array([0.], dtype=input.dtype, ctx=input.context)
return nd.sum(input, axis=dim, keepdims=keepdims)

def floor_div(in1, in2):
return in1 / in2

def reduce_sum(input):
return input.sum()

Expand Down
3 changes: 3 additions & 0 deletions python/dgl/backend/pytorch/tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,9 @@ def copy_to(input, ctx, **kwargs):
def sum(input, dim, keepdims=False):
return th.sum(input, dim=dim, keepdim=keepdims)

def floor_div(in1, in2):
return in1 // in2

def reduce_sum(input):
return input.sum()

Expand Down
2 changes: 2 additions & 0 deletions python/dgl/backend/tensorflow/tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,8 @@ def sum(input, dim, keepdims=False):
input = tf.cast(input, tf.int32)
return tf.reduce_sum(input, axis=dim, keepdims=keepdims)

def floor_div(in1, in2):
return astype(in1 / in2, dtype(in1))

def reduce_sum(input):
if input.dtype == tf.bool:
Expand Down
6 changes: 3 additions & 3 deletions python/dgl/data/citation_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,9 +184,9 @@ def load(self):
self._graph = nx.DiGraph(graph)

self._num_classes = info['num_classes']
self._g.ndata['train_mask'] = generate_mask_tensor(self._g.ndata['train_mask'].numpy())
self._g.ndata['val_mask'] = generate_mask_tensor(self._g.ndata['val_mask'].numpy())
self._g.ndata['test_mask'] = generate_mask_tensor(self._g.ndata['test_mask'].numpy())
self._g.ndata['train_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['train_mask']))
self._g.ndata['val_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['val_mask']))
self._g.ndata['test_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['test_mask']))
# hack for mxnet compatability

if self.verbose:
Expand Down
2 changes: 1 addition & 1 deletion python/dgl/distributed/dist_dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ def __init__(self, dataset, batch_size, shuffle=False, collate_fn=None, drop_las
if not self.drop_last and len(dataset) % self.batch_size != 0:
self.expected_idxs += 1

# We need to have a unique Id for each data loader to identify itself
# We need to have a unique ID for each data loader to identify itself
# in the sampler processes.
global DATALOADER_ID
self.name = "dataloader-" + str(DATALOADER_ID)
Expand Down
Loading

0 comments on commit 25ac334

Please sign in to comment.