[Distributed] Heterogeneous graph support (dmlc#2457)

* Distributed heterograph (dmlc#3) * heterogeneous graph partition. * fix graph partition book for heterograph. * load heterograph partitions. * update DistGraphServer to support heterograph. * make DistGraph runnable for heterograph. * partition a graph and store parts with homogeneous graph structure. * update DistGraph server&client to use homogeneous graph. * shuffle node Ids based on node types. * load mag in heterograph. * fix per-node-type mapping. * balance node types. * fix for homogeneous graph * store etype for now. * fix data name. * fix a bug in example. * add profiler in rgcn. * heterogeneous RGCN. * map homogeneous node ids to hetero node ids. * fix graph partition book. * fix DistGraph. * shuffle eids. * verify eids and their mappings when loading a partition. * Id map from homogneous Ids to per-type Ids. * verify partitioned results. * add test for distributed sampler. * add mapping from per-type Ids to homogeneous Ids. * update example. * fix DistGraph. * Revert "add profiler in rgcn." This reverts commit 36daaed8b660933dac8f61a39faec3da2467d676. * add tests for homogeneous graphs. * fix a bug. * fix test. * fix for one partition. * fix for standalone training and evaluation. * small fix. * fix two bugs. * initialize projection matrix. * small fix on RGCN. * Fix rgcn performance (dmlc#17) Co-authored-by: Ubuntu <[email protected]> * fix lint. * fix lint. * fix lint. * fix lint. * fix lint. * fix lint. * fix. * fix test. * fix lint. * test partitions. * remove redundant test for partitioning. * remove commented code. * fix partition. * fix tests. * fix RGCN. * fix test. * fix test. * fix test. * fix. * fix a bug. * update dmlc-core. * fix. * fix rgcn. * update readme. * add comments. Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: xiang song(charlie.song) <[email protected]> Co-authored-by: Ubuntu <[email protected]> * fix. * fix. * add div_int. * fix. * fix. * fix lint. * fix. * fix. * fix. * adjust. * move code. * handle heterograph. * return pytorch tensor in GPB. * remove some tests in example. * add to_block for distributed training. * use distributed to_block. * remove unnecessary function in DistGraph. * remove distributed to_block. * use pytorch tensor. * fix a bug in ntypes and etypes. * enable norm. * make the data loader compatible with the old format. * fix. * add comments. * fix a bug. * add test for heterograph. * support partition without reshuffle. * add test. * support partition without reshuffle. * fix. * add test. * fix bugs. * fix lint. * fix dataset. * fix for mxnet. * update docstring. * rename to floor_div * avoid exposing NodePartitionPolicy and EdgePartitionPolicy. * fix docstring. * fix error. * fixes. * fix comments. * rename. * rename. * explain IdMap. * fix docstring. * fix docstring. * update docstring. * remove the code of returning heterograph. * remove argument. * fix example. * make GraphPartitionBook an abstract class. * fix. * fix. * fix a bug. * fix a bug in example * fix a bug * reverse heterograph sampling. * temp fix. * fix lint. * Revert "temp fix." This reverts commit c450717. * compute norm. * Revert "reverse heterograph sampling." This reverts commit bd6deb7. * fix. * move id_map.py * remove check * add more comments. * update docstring. Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: xiang song(charlie.song) <[email protected]> Co-authored-by: Ubuntu <[email protected]>
skorani · Jan 25, 2021 · 25ac334 · 25ac334
1 parent aa884d4
commit 25ac334
Show file tree

Hide file tree

Showing 29 changed files with 2,294 additions and 601 deletions.
diff --git a/examples/pytorch/graphsage/experimental/README.md b/examples/pytorch/graphsage/experimental/README.md
@@ -39,6 +39,8 @@ the number of nodes, the number of edges and the number of labelled nodes.
 python3 partition_graph.py --dataset ogb-product --num_parts 4 --balance_train --balance_edges
 ```
 
+This script generates partitioned graphs and store them in the directory called `data`.
+
 ### Step 2: copy the partitioned data and files to the cluster
 
 DGL provides a script for copying partitioned data and files to the cluster. Before that, copy the training script to a local folder:

diff --git a/examples/pytorch/rgcn/experimental/README.md b/examples/pytorch/rgcn/experimental/README.md
@@ -1,6 +1,6 @@
 ## Distributed training
 
-This is an example of training RGCN node classification in a distributed fashion. Currently, the example only support training RGCN graphs with no input features. The current implementation follows ../rgcn/entity_claasify_mp.py.
+This is an example of training RGCN node classification in a distributed fashion. Currently, the example train RGCN graphs with input node features. The current implementation follows ../rgcn/entity_claasify_mp.py.
 
 Before training, please install some python libs by pip:
 
@@ -36,6 +36,8 @@ the number of nodes, the number of edges and the number of labelled nodes.
 python3 partition_graph.py --dataset ogbn-mag --num_parts 4 --balance_train --balance_edges
 ```
 
+This script generates partitioned graphs and store them in the directory called `data`.
+
 ### Step 2: copy the partitioned data to the cluster
 DGL provides a script for copying partitioned data to the cluster. Before that, copy the training script to a local folder:
 
@@ -78,7 +80,7 @@ python3 ~/dgl/tools/launch.py \
 --num_samplers 4 \
 --part_config data/ogbn-mag.json \
 --ip_config ip_config.txt \
-"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512  --n-hidden 64 --lr 0.01 --eval-batch-size 16  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --num-workers 4 --num-servers 1 --sparse-embedding  --sparse-lr 0.06"
+"python3 dgl_code/entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 512  --n-hidden 64 --lr 0.01 --eval-batch-size 16  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --num-workers 4 --num-servers 1 --sparse-embedding  --sparse-lr 0.06 --node-feats"
 ```
 
 We can get the performance score at the second epoch:
@@ -98,5 +100,5 @@ python3 partition_graph.py --dataset ogbn-mag --num_parts 1
 
 ### Step 2: run the training script
 ```bash
-python3 entity_classify_dist.py --graph-name ogbn-mag  --dataset ogbn-mag --fanout='25,25' --batch-size 256 --n-hidden 64 --lr 0.01 --eval-batch-size 8 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone
+python3 entity_classify_dist.py --graph-name ogbn-mag  --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 128 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone  --sparse-embedding  --sparse-lr 0.06 --node-feats
 ```
diff --git a/examples/pytorch/rgcn/experimental/entity_classify_dist.py b/examples/pytorch/rgcn/experimental/entity_classify_dist.py
diff --git a/examples/pytorch/rgcn/experimental/partition_graph.py b/examples/pytorch/rgcn/experimental/partition_graph.py
@@ -6,7 +6,7 @@
 
 from ogb.nodeproppred import DglNodePropPredDataset
 
-def load_ogb(dataset, global_norm):
+def load_ogb(dataset):
     if dataset == 'ogbn-mag':
         dataset = DglNodePropPredDataset(name=dataset)
         split_idx = dataset.get_idx_split()
@@ -33,54 +33,24 @@ def load_ogb(dataset, global_norm):
         print('Number of valid: {}'.format(len(val_idx)))
         print('Number of test: {}'.format(len(test_idx)))
 
-        # currently we do not support node feature in mag dataset.
-        # calculate norm for each edge type and store in edge
-        if global_norm is False:
-            for canonical_etype in hg.canonical_etypes:
-                u, v, eid = hg.all_edges(form='all', etype=canonical_etype)
-                _, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
-                degrees = count[inverse_index]
-                norm = th.ones(eid.shape[0]) / degrees
-                norm = norm.unsqueeze(1)
-                hg.edges[canonical_etype].data['norm'] = norm
-
         # get target category id
         category_id = len(hg.ntypes)
         for i, ntype in enumerate(hg.ntypes):
             if ntype == category:
                 category_id = i
 
-        g = dgl.to_homogeneous(hg, edata=['norm'])
-        if global_norm:
-            u, v, eid = g.all_edges(form='all')
-            _, inverse_index, count = th.unique(v, return_inverse=True, return_counts=True)
-            degrees = count[inverse_index]
-            norm = th.ones(eid.shape[0]) / degrees
-            norm = norm.unsqueeze(1)
-            g.edata['norm'] = norm
-
-        node_ids = th.arange(g.number_of_nodes())
-        # find out the target node ids
-        node_tids = g.ndata[dgl.NTYPE]
-        loc = (node_tids == category_id)
-        target_idx = node_ids[loc]
-        train_idx = target_idx[train_idx]
-        val_idx = target_idx[val_idx]
-        test_idx = target_idx[test_idx]
-        train_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        train_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
         train_mask[train_idx] = True
-        val_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        val_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
         val_mask[val_idx] = True
-        test_mask = th.zeros((g.number_of_nodes(),), dtype=th.bool)
+        test_mask = th.zeros((hg.number_of_nodes('paper'),), dtype=th.bool)
         test_mask[test_idx] = True
-        g.ndata['train_mask'] = train_mask
-        g.ndata['val_mask'] = val_mask
-        g.ndata['test_mask'] = test_mask
+        hg.nodes['paper'].data['train_mask'] = train_mask
+        hg.nodes['paper'].data['val_mask'] = val_mask
+        hg.nodes['paper'].data['test_mask'] = test_mask
 
-        labels = th.full((g.number_of_nodes(),), -1, dtype=paper_labels.dtype)
-        labels[target_idx] = paper_labels
-        g.ndata['labels'] = labels
-        return g
+        hg.nodes['paper'].data['labels'] = paper_labels
+        return hg
     else:
         raise("Do not support other ogbn datasets.")
 
@@ -98,21 +68,19 @@ def load_ogb(dataset, global_norm):
                            help='turn the graph into an undirected graph.')
     argparser.add_argument('--balance_edges', action='store_true',
                            help='balance the number of edges in each partition.')
-    argparser.add_argument('--global-norm', default=False, action='store_true',
-                           help='User global norm instead of per node type norm')
     args = argparser.parse_args()
 
     start = time.time()
-    g = load_ogb(args.dataset, args.global_norm)
+    g = load_ogb(args.dataset)
 
     print('load {} takes {:.3f} seconds'.format(args.dataset, time.time() - start))
     print('|V|={}, |E|={}'.format(g.number_of_nodes(), g.number_of_edges()))
-    print('train: {}, valid: {}, test: {}'.format(th.sum(g.ndata['train_mask']),
-                                                  th.sum(g.ndata['val_mask']),
-                                                  th.sum(g.ndata['test_mask'])))
+    print('train: {}, valid: {}, test: {}'.format(th.sum(g.nodes['paper'].data['train_mask']),
+                                                  th.sum(g.nodes['paper'].data['val_mask']),
+                                                  th.sum(g.nodes['paper'].data['test_mask'])))
 
     if args.balance_train:
-        balance_ntypes = g.ndata['train_mask']
+        balance_ntypes = {'paper': g.nodes['paper'].data['train_mask']}
     else:
         balance_ntypes = None
 

diff --git a/python/dgl/backend/backend.py b/python/dgl/backend/backend.py
@@ -355,6 +355,22 @@ def sum(input, dim, keepdims=False):
     """
     pass
 
+def floor_div(in1, in2):
+    """Element-wise integer division and rounds each quotient towards zero.
+
+    Parameters
+    ----------
+    in1 : Tensor
+        The input tensor
+    in2 : Tensor or integer
+        The input
+
+    Returns
+    -------
+    Tensor
+        A framework-specific tensor.
+    """
+
 def reduce_sum(input):
     """Returns the sum of all elements in the input tensor.
 

diff --git a/python/dgl/backend/mxnet/tensor.py b/python/dgl/backend/mxnet/tensor.py
@@ -149,6 +149,9 @@ def sum(input, dim, keepdims=False):
         return nd.array([0.], dtype=input.dtype, ctx=input.context)
     return nd.sum(input, axis=dim, keepdims=keepdims)
 
+def floor_div(in1, in2):
+    return in1 / in2
+
 def reduce_sum(input):
     return input.sum()
 

diff --git a/python/dgl/backend/pytorch/tensor.py b/python/dgl/backend/pytorch/tensor.py
@@ -117,6 +117,9 @@ def copy_to(input, ctx, **kwargs):
 def sum(input, dim, keepdims=False):
     return th.sum(input, dim=dim, keepdim=keepdims)
 
+def floor_div(in1, in2):
+    return in1 // in2
+
 def reduce_sum(input):
     return input.sum()
 

diff --git a/python/dgl/backend/tensorflow/tensor.py b/python/dgl/backend/tensorflow/tensor.py
@@ -168,6 +168,8 @@ def sum(input, dim, keepdims=False):
         input = tf.cast(input, tf.int32)
     return tf.reduce_sum(input, axis=dim, keepdims=keepdims)
 
+def floor_div(in1, in2):
+    return astype(in1 / in2, dtype(in1))
 
 def reduce_sum(input):
     if input.dtype == tf.bool:

diff --git a/python/dgl/data/citation_graph.py b/python/dgl/data/citation_graph.py
@@ -184,9 +184,9 @@ def load(self):
         self._graph = nx.DiGraph(graph)
 
         self._num_classes = info['num_classes']
-        self._g.ndata['train_mask'] = generate_mask_tensor(self._g.ndata['train_mask'].numpy())
-        self._g.ndata['val_mask'] = generate_mask_tensor(self._g.ndata['val_mask'].numpy())
-        self._g.ndata['test_mask'] = generate_mask_tensor(self._g.ndata['test_mask'].numpy())
+        self._g.ndata['train_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['train_mask']))
+        self._g.ndata['val_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['val_mask']))
+        self._g.ndata['test_mask'] = generate_mask_tensor(F.asnumpy(self._g.ndata['test_mask']))
         # hack for mxnet compatability
 
         if self.verbose:

diff --git a/python/dgl/distributed/dist_dataloader.py b/python/dgl/distributed/dist_dataloader.py
@@ -133,7 +133,7 @@ def __init__(self, dataset, batch_size, shuffle=False, collate_fn=None, drop_las
         if not self.drop_last and len(dataset) % self.batch_size != 0:
             self.expected_idxs += 1
 
-        # We need to have a unique Id for each data loader to identify itself
+        # We need to have a unique ID for each data loader to identify itself
         # in the sampler processes.
         global DATALOADER_ID
         self.name = "dataloader-" + str(DATALOADER_ID)