[Distributed] Deprecate old DistEmbedding impl, use synchronized embe…

…dding impl (dmlc#3111) * fix. * fix. * fix. * fix. * Fix test * Deprecate old DistEmbedding impl, use synchronized embedding impl * update doc Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Da Zheng <[email protected]> Co-authored-by: Jinjing Zhou <[email protected]>
lizeyan · Jul 13, 2021 · d739076 · d739076
1 parent ee6bc95
commit d739076
Show file tree

Hide file tree

Showing 18 changed files with 62 additions and 251 deletions.
diff --git a/docs/source/api/python/dgl.distributed.rst b/docs/source/api/python/dgl.distributed.rst
@@ -27,9 +27,9 @@ Distributed Tensor
 
 Distributed Node Embedding
 ---------------------
-.. currentmodule:: dgl.distributed.nn.pytorch
+.. currentmodule:: dgl.distributed
 
-.. autoclass:: NodeEmbedding
+.. autoclass:: DistEmbedding
 
 
 Distributed embedding optimizer

diff --git a/docs/source/guide/distributed-apis.rst b/docs/source/guide/distributed-apis.rst
@@ -9,7 +9,7 @@ This section covers the distributed APIs used in the training script. DGL provid
 data structures and various APIs for initialization, distributed sampling and workload split.
 For distributed training/inference, DGL provides three distributed data structures:
 :class:`~dgl.distributed.DistGraph` for distributed graphs, :class:`~dgl.distributed.DistTensor` for
-distributed tensors and :class:`~dgl.distributed.nn.NodeEmbedding` for distributed learnable embeddings.
+distributed tensors and :class:`~dgl.distributed.DistEmbedding` for distributed learnable embeddings.
 
 Initialization of the DGL distributed module
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -27,7 +27,7 @@ Typically, the initialization APIs should be invoked in the following order:
     th.distributed.init_process_group(backend='gloo')
 
 **Note**: If the training script contains user-defined functions (UDFs) that have to be invoked on
-the servers (see the section of DistTensor and NodeEmbedding for more details), these UDFs have to
+the servers (see the section of DistTensor and DistEmbedding for more details), these UDFs have to
 be declared before :func:`~dgl.distributed.initialize`.
 
 Distributed graph
@@ -153,10 +153,10 @@ computation operators, such as sum and mean.
 when a machine runs multiple servers. This may result in data corruption. One way to avoid concurrent
 writes to the same row of data is to run one server process on a machine.
 
-Distributed NodeEmbedding
+Distributed DistEmbedding
 ~~~~~~~~~~~~~~~~~~~~~
 
-DGL provides :class:`~dgl.distributed.nn.NodeEmbedding` to support transductive models that require
+DGL provides :class:`~dgl.distributed.DistEmbedding` to support transductive models that require
 node embeddings. Creating distributed embeddings is very similar to creating distributed tensors.
 
 .. code:: python
@@ -165,7 +165,7 @@ node embeddings. Creating distributed embeddings is very similar to creating dis
         arr = th.zeros(shape, dtype=dtype)
         arr.uniform_(-1, 1)
         return arr
-    emb = dgl.distributed.nn.NodeEmbedding(g.number_of_nodes(), 10, init_func=initializer)
+    emb = dgl.distributed.DistEmbedding(g.number_of_nodes(), 10, init_func=initializer)
 
 Internally, distributed embeddings are built on top of distributed tensors, and, thus, has
 very similar behaviors to distributed tensors. For example, when embeddings are created, they
@@ -192,7 +192,7 @@ the other for dense model parameters, as shown in the code below:
     optimizer.step()
     sparse_optimizer.step()
 
-**Note**: :class:`~dgl.distributed.nn.NodeEmbedding` is not an Pytorch nn module, so we cannot
+**Note**: :class:`~dgl.distributed.DistEmbedding` is not an Pytorch nn module, so we cannot
 get access to it from parameters of a Pytorch nn module.
 
 Distributed sampling

diff --git a/docs/source/guide/distributed.rst b/docs/source/guide/distributed.rst
@@ -85,7 +85,7 @@ Specifically, DGL's distributed training has three types of interacting processe
   generate mini-batches for training.
 * Trainers contain multiple classes to interact with servers. It has
   :class:`~dgl.distributed.DistGraph` to get access to partitioned graph data and has
-  :class:`~dgl.distributed.nn.NodeEmbedding` and :class:`~dgl.distributed.DistTensor` to access
+  :class:`~dgl.distributed.DistEmbedding` and :class:`~dgl.distributed.DistTensor` to access
   the node/edge features/embeddings. It has
   :class:`~dgl.distributed.dist_dataloader.DistDataLoader` to
   interact with samplers to get mini-batches.

diff --git a/docs/source/guide_cn/distributed-apis.rst b/docs/source/guide_cn/distributed-apis.rst
@@ -8,7 +8,7 @@
 本节介绍了在训练脚本中使用的分布式计算API。DGL提供了三种分布式数据结构和多种API，用于初始化、分布式采样和数据分割。
 对于分布式训练/推断，DGL提供了三种分布式数据结构：用于分布式图的 :class:`~dgl.distributed.DistGraph`、
 用于分布式张量的 :class:`~dgl.distributed.DistTensor` 和用于分布式可学习嵌入的
-:class:`~dgl.distributed.nn.NodeEmbedding`。
+:class:`~dgl.distributed.DistEmbedding`。
 
 DGL分布式模块的初始化
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -24,7 +24,7 @@ DGL分布式模块的初始化
     dgl.distributed.initialize('ip_config.txt')
     th.distributed.init_process_group(backend='gloo')
 
-**Note**: 如果训练脚本里包含需要在服务器(细节内容可以在下面的DistTensor和NodeEmbedding章节里查看)上调用的用户自定义函数(UDF)，
+**Note**: 如果训练脚本里包含需要在服务器(细节内容可以在下面的DistTensor和DistEmbedding章节里查看)上调用的用户自定义函数(UDF)，
 这些UDF必须在 :func:`~dgl.distributed.initialize` 之前被声明。
 
 分布式图
@@ -138,7 +138,7 @@ DGL为分布式张量提供了类似于单机普通张量的接口，以访问
 分布式嵌入
 ~~~~~~~~~~~~~~~~~~~~~
 
-DGL提供 :class:`~dgl.distributed.nn.NodeEmbedding` 以支持需要节点嵌入的直推(transductive)模型。
+DGL提供 :class:`~dgl.distributed.DistEmbedding` 以支持需要节点嵌入的直推(transductive)模型。
 分布式嵌入的创建与分布式张量的创建非常相似。
 
 .. code:: python
@@ -147,7 +147,7 @@ DGL提供 :class:`~dgl.distributed.nn.NodeEmbedding` 以支持需要节点嵌入
         arr = th.zeros(shape, dtype=dtype)
         arr.uniform_(-1, 1)
         return arr
-    emb = dgl.distributed.nn.NodeEmbedding(g.number_of_nodes(), 10, init_func=initializer)
+    emb = dgl.distributed.DistEmbedding(g.number_of_nodes(), 10, init_func=initializer)
 
 在内部，分布式嵌入建立在分布式张量之上，因此，其行为与分布式张量非常相似。
 例如，创建嵌入时，DGL会将它们分片并存储在集群中的所有计算机上。(分布式嵌入)可以通过名称唯一标识。
@@ -169,7 +169,7 @@ DGL提供了一个稀疏的Adagrad优化器 :class:`~dgl.distributed.SparseAdagr
     optimizer.step()
     sparse_optimizer.step()
 
-**Note**: :class:`~dgl.distributed.nn.NodeEmbedding` 不是PyTorch的nn模块，因此用户无法从nn模块的参数访问它。
+**Note**: :class:`~dgl.distributed.DistEmbedding` 不是PyTorch的nn模块，因此用户无法从nn模块的参数访问它。
 
 分布式采样
 ~~~~~~~~~~~~~~~~~~~~

diff --git a/docs/source/guide_cn/distributed.rst b/docs/source/guide_cn/distributed.rst
@@ -74,7 +74,7 @@ DGL实现了一些分布式组件以支持分布式训练，下图显示了这
   这些服务器一起工作以将图数据提供给训练器。请注意，一台机器可能同时运行多个服务器进程，以并行化计算和网络通信。
 * *采样器进程* 与服务器进行交互，并对节点和边采样以生成用于训练的小批次数据。
 * *训练器进程* 包含多个与服务器交互的类。它用 :class:`~dgl.distributed.DistGraph` 来获取被划分的图分区数据，
-  用 :class:`~dgl.distributed.nn.NodeEmbedding` 和
+  用 :class:`~dgl.distributed.DistEmbedding` 和
   :class:`~dgl.distributed.DistTensor` 来获取节点/边特征/嵌入，用
   :class:`~dgl.distributed.dist_dataloader.DistDataLoader` 与采样器进行交互以获得小批次数据。
 

diff --git a/examples/pytorch/graphsage/experimental/README.md b/examples/pytorch/graphsage/experimental/README.md
@@ -164,7 +164,7 @@ python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pyt
 "python3 train_dist_transductive.py --graph_name ogb-product --ip_config ip_config.txt --batch_size 1000 --num_gpu 4 --eval_every 5"
 ```
 
-To run supervised with transductive setting using dgl distributed NodeEmbedding
+To run supervised with transductive setting using dgl distributed DistEmbedding
 ```bash
 python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
 --num_trainers 4 \
@@ -188,7 +188,7 @@ python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pyt
 "python3 train_dist_unsupervised_transductive.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 4"
 ```
 
-To run unsupervised with transductive setting using dgl distributed NodeEmbedding
+To run unsupervised with transductive setting using dgl distributed DistEmbedding
 ```bash
 python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/experimental/ \
 --num_trainers 4 \

diff --git a/examples/pytorch/graphsage/experimental/train_dist_transductive.py b/examples/pytorch/graphsage/experimental/train_dist_transductive.py
@@ -13,7 +13,7 @@
 import dgl.function as fn
 import dgl.nn.pytorch as dglnn
 from dgl.distributed import DistDataLoader
-from dgl.distributed.nn import NodeEmbedding
+from dgl.distributed import DistEmbedding
 
 import torch as th
 import torch.nn as nn
@@ -91,7 +91,7 @@ def __init__(self, num_nodes, emb_size, dgl_sparse_emb=False, dev_id='cpu'):
         self.emb_size = emb_size
         self.dgl_sparse_emb = dgl_sparse_emb
         if dgl_sparse_emb:
-            self.sparse_emb = NodeEmbedding(num_nodes, emb_size, name='sage', init_func=initializer)
+            self.sparse_emb = DistEmbedding(num_nodes, emb_size, name='sage', init_func=initializer)
         else:
             self.sparse_emb = th.nn.Embedding(num_nodes, emb_size, sparse=True)
             nn.init.uniform_(self.sparse_emb.weight, -1.0, 1.0)

diff --git a/examples/pytorch/graphsage/experimental/train_dist_unsupervised_transductive.py b/examples/pytorch/graphsage/experimental/train_dist_unsupervised_transductive.py
@@ -22,7 +22,6 @@
 import torch.multiprocessing as mp
 from dgl.distributed import DistDataLoader
 
-from dgl.distributed.optim import SparseAdagrad
 from train_dist_unsupervised import SAGE, NeighborSampler, PosNeighborSampler, CrossEntropyLoss, compute_acc
 from train_dist_transductive import DistEmb, load_embs
 

diff --git a/examples/pytorch/rgcn/experimental/README.md b/examples/pytorch/rgcn/experimental/README.md
@@ -126,7 +126,7 @@ We can get the performance score at the second epoch:
 Val Acc 0.4323, Test Acc 0.4255, time: 128.0379
 ```
 
-The command below launches the same distributed training job using dgl distributed NodeEmbedding
+The command below launches the same distributed training job using dgl distributed DistEmbedding
 ```bash
 python3 ~/workspace/dgl/tools/launch.py \
 --workspace ~/workspace/dgl/examples/pytorch/rgcn/experimental/ \
@@ -135,7 +135,7 @@ python3 ~/workspace/dgl/tools/launch.py \
 --num_samplers 4 \
 --part_config data/ogbn-mag.json \
 --ip_config ip_config.txt \
-"python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 1024  --n-hidden 64 --lr 0.01 --eval-batch-size 1024  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --sparse-embedding --sparse-lr 0.06 --num_gpus 1"
+"python3 entity_classify_dist.py --graph-name ogbn-mag --dataset ogbn-mag --fanout='25,25' --batch-size 1024  --n-hidden 64 --lr 0.01 --eval-batch-size 1024  --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt  --sparse-embedding --sparse-lr 0.06 --num_gpus 1 --dgl-sparse"
 ```
 
 We can get the performance score at the second epoch:
@@ -218,5 +218,5 @@ python3 partition_graph.py --dataset ogbn-mag --num_parts 1
 
 ### Step 2: run the training script
 ```bash
-python3 entity_classify_dist.py --graph-name ogbn-mag  --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 128 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone  --sparse-embedding  --sparse-lr 0.06 --node-feats
+DGL_DIST_MODE=standalone python3 entity_classify_dist.py --graph-name ogbn-mag  --dataset ogbn-mag --fanout='25,25' --batch-size 512 --n-hidden 64 --lr 0.01 --eval-batch-size 128 --low-mem --dropout 0.5 --use-self-loop --n-bases 2 --n-epochs 3 --layer-norm --ip-config ip_config.txt --conf-path 'data/ogbn-mag.json' --standalone  --sparse-embedding  --sparse-lr 0.06
 ```
diff --git a/examples/pytorch/rgcn/experimental/entity_classify_dist.py b/examples/pytorch/rgcn/experimental/entity_classify_dist.py
@@ -10,7 +10,7 @@
 import itertools
 import numpy as np
 import time
-import os
+import os, gc
 os.environ['DGLBACKEND']='pytorch'
 
 import torch as th
@@ -162,7 +162,7 @@ def __init__(self,
                     # We only create embeddings for nodes without node features.
                     if feat_name not in g.nodes[ntype].data:
                         part_policy = g.get_node_partition_policy(ntype)
-                        self.node_embeds[ntype] = dgl.distributed.nn.NodeEmbedding(g.number_of_nodes(ntype),
+                        self.node_embeds[ntype] = dgl.distributed.DistEmbedding(g.number_of_nodes(ntype),
                                 self.embed_size,
                                 embed_name + '_' + ntype,
                                 init_emb,
@@ -229,6 +229,7 @@ def evaluate(g, model, embed_layer, labels, eval_loader, test_loader, all_val_ni
     global_results = dgl.distributed.DistTensor(labels.shape, th.long, 'results', persistent=True)
 
     with th.no_grad():
+        th.cuda.empty_cache()
         for sample_data in tqdm.tqdm(eval_loader):
             seeds, blocks = sample_data
             for block in blocks:
@@ -245,6 +246,7 @@ def evaluate(g, model, embed_layer, labels, eval_loader, test_loader, all_val_ni
     test_logits = []
     test_seeds = []
     with th.no_grad():
+        th.cuda.empty_cache()
         for sample_data in tqdm.tqdm(test_loader):
             seeds, blocks = sample_data
             for block in blocks:
@@ -347,7 +349,7 @@ def run(args, device, data):
     # Create DataLoader for constructing blocks
     test_dataloader = DistDataLoader(
         dataset=test_nid,
-        batch_size=args.batch_size,
+        batch_size=args.eval_batch_size,
         collate_fn=test_sampler.sample_blocks,
         shuffle=False,
         drop_last=False)
@@ -486,6 +488,7 @@ def run(args, device, data):
                     np.sum(backward_t[-args.log_every:]), np.sum(update_t[-args.log_every:])))
             start = time.time()
 
+        gc.collect()
         print('[{}]Epoch Time(s): {:.4f}, sample: {:.4f}, data copy: {:.4f}, forward: {:.4f}, backward: {:.4f}, update: {:.4f}, #train: {}, #input: {}'.format(
             g.rank(), np.sum(step_time), np.sum(sample_t), np.sum(feat_copy_t), np.sum(forward_t), np.sum(backward_t), np.sum(update_t), number_train, number_input))
         epoch += 1

diff --git a/python/dgl/distributed/__init__.py b/python/dgl/distributed/__init__.py
@@ -18,8 +18,7 @@
 from .dist_tensor import DistTensor
 from .partition import partition_graph, load_partition, load_partition_book
 from .graph_partition_book import GraphPartitionBook, PartitionPolicy
-from .sparse_emb import SparseAdagrad, DistEmbedding
-from . import nn
+from .nn import *
 from . import optim
 
 from .rpc import *

diff --git a/python/dgl/distributed/nn/pytorch/__init__.py b/python/dgl/distributed/nn/pytorch/__init__.py
@@ -1,2 +1,2 @@
 """dgl distributed sparse optimizer for pytorch."""
-from .sparse_emb import NodeEmbedding
+from .sparse_emb import DistEmbedding
diff --git a/python/dgl/distributed/nn/pytorch/sparse_emb.py b/python/dgl/distributed/nn/pytorch/sparse_emb.py
@@ -5,7 +5,7 @@
 from .... import utils
 from ...dist_tensor import DistTensor
 
-class NodeEmbedding:
+class DistEmbedding:
     '''Distributed node embeddings.
 
     DGL provides a distributed embedding to support models that require learnable embeddings.
@@ -34,7 +34,7 @@ class NodeEmbedding:
         The dimension size of embeddings.
     name : str, optional
         The name of the embeddings. The name can uniquely identify embeddings in a system
-        so that another NodeEmbedding object can referent to the same embeddings.
+        so that another DistEmbedding object can referent to the same embeddings.
     init_func : callable, optional
         The function to create the initial data. If the init function is not provided,
         the values of the embeddings are initialized to zero.
@@ -49,7 +49,7 @@ class NodeEmbedding:
             arr = th.zeros(shape, dtype=dtype)
             arr.uniform_(-1, 1)
             return arr
-    >>> emb = dgl.distributed.nn.NodeEmbedding(g.number_of_nodes(), 10, init_func=initializer)
+    >>> emb = dgl.distributed.DistEmbedding(g.number_of_nodes(), 10, init_func=initializer)
     >>> optimizer = dgl.distributed.optim.SparseAdagrad([emb], lr=0.001)
     >>> for blocks in dataloader:
     ...     feats = emb(nids)
@@ -59,7 +59,7 @@ class NodeEmbedding:
 
     Note
     ----
-    When a ``NodeEmbedding``  object is used when the deep learning framework is recording
+    When a ``DistEmbedding``  object is used when the deep learning framework is recording
     the forward computation, users have to invoke
     py:meth:`~dgl.distributed.optim.SparseAdagrad.step` afterwards. Otherwise, there will be
     some memory leak.