[Sampling] New sampling pipeline plus asynchronous prefetching (dmlc#…

…3665) * initial update * more * more * multi-gpu example * cluster gcn, finalize homogeneous * more explanation * fix * bunch of fixes * fix * RGAT example and more fixes * shadow-gnn sampler and some changes in unit test * fix * wth * more fixes * remove shadow+node/edge dataloader tests for possible ux changes * lints * add legacy dataloading import just in case * fix * update pylint for f-strings * fix * lint * lint * lint again * cherry-picking commit fa9f494 * oops * fix * add sample_neighbors in dist_graph * fix * lint * fix * fix * fix * fix tutorial * fix * fix * fix * fix warning * remove debug * add get_foo_storage apis * lint
KoyamaSohei · Jan 30, 2022 · 701b4fc · 701b4fc
1 parent 5152a87
commit 701b4fc
Show file tree

Hide file tree

Showing 62 changed files with 4,011 additions and 1,487 deletions.
diff --git a/docs/source/api/python/dgl.dataloading.rst b/docs/source/api/python/dgl.dataloading.rst
@@ -17,6 +17,8 @@ and an ``EdgeDataLoader`` for edge/link prediction task.
 .. autoclass:: NodeDataLoader
 .. autoclass:: EdgeDataLoader
 .. autoclass:: GraphDataLoader
+.. autoclass:: DistNodeDataLoader
+.. autoclass:: DistEdgeDataLoader
 
 .. _api-dataloading-neighbor-sampling:
 

diff --git a/docs/source/guide/distributed-apis.rst b/docs/source/guide/distributed-apis.rst
@@ -202,20 +202,20 @@ DGL provides two levels of APIs for sampling nodes and edges to generate mini-ba
 (see the section of mini-batch training). The low-level APIs require users to write code
 to explicitly define how a layer of nodes are sampled (e.g., using :func:`dgl.sampling.sample_neighbors` ).
 The high-level sampling APIs implement a few popular sampling algorithms for node classification
-and link prediction tasks (e.g., :class:`~dgl.dataloading.pytorch.NodeDataloader` and
-:class:`~dgl.dataloading.pytorch.EdgeDataloader` ).
+and link prediction tasks (e.g., :class:`~dgl.dataloading.pytorch.NodeDataLoader` and
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` ).
 
 The distributed sampling module follows the same design and provides two levels of sampling APIs.
 For the lower-level sampling API, it provides :func:`~dgl.distributed.sample_neighbors` for
 distributed neighborhood sampling on :class:`~dgl.distributed.DistGraph`. In addition, DGL provides
-a distributed Dataloader (:class:`~dgl.distributed.DistDataLoader` ) for distributed sampling.
-The distributed Dataloader has the same interface as Pytorch DataLoader except that users cannot
+a distributed DataLoader (:class:`~dgl.distributed.DistDataLoader` ) for distributed sampling.
+The distributed DataLoader has the same interface as Pytorch DataLoader except that users cannot
 specify the number of worker processes when creating a dataloader. The worker processes are created
 in :func:`dgl.distributed.initialize`.
 
 **Note**: When running :func:`dgl.distributed.sample_neighbors` on :class:`~dgl.distributed.DistGraph`,
-the sampler cannot run in Pytorch Dataloader with multiple worker processes. The main reason is that
-Pytorch Dataloader creates new sampling worker processes in every epoch, which leads to creating and
+the sampler cannot run in Pytorch DataLoader with multiple worker processes. The main reason is that
+Pytorch DataLoader creates new sampling worker processes in every epoch, which leads to creating and
 destroying :class:`~dgl.distributed.DistGraph` objects many times.
 
 When using the low-level API, the sampling code is similar to single-process sampling. The only
@@ -240,17 +240,17 @@ difference is that users need to use :func:`dgl.distributed.sample_neighbors` an
         for batch in dataloader:
             ...
 
-The same high-level sampling APIs (:class:`~dgl.dataloading.pytorch.NodeDataloader` and
-:class:`~dgl.dataloading.pytorch.EdgeDataloader` ) work for both :class:`~dgl.DGLGraph`
-and :class:`~dgl.distributed.DistGraph`. When using :class:`~dgl.dataloading.pytorch.NodeDataloader`
-and :class:`~dgl.dataloading.pytorch.EdgeDataloader`, the distributed sampling code is exactly
-the same as single-process sampling.
+The high-level sampling APIs (:class:`~dgl.dataloading.pytorch.NodeDataLoader` and
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` ) has distributed counterparts
+(:class:`~dgl.dataloading.pytorch.DistNodeDataLoader` and
+:class:`~dgl.dataloading.pytorch.DistEdgeDataLoader`).  The code is exactly the
+same as single-process sampling otherwise.
 
 .. code:: python
 
     sampler = dgl.sampling.MultiLayerNeighborSampler([10, 25])
-    dataloader = dgl.sampling.NodeDataLoader(g, train_nid, sampler,
-                                             batch_size=batch_size, shuffle=True)
+    dataloader = dgl.sampling.DistNodeDataLoader(g, train_nid, sampler,
+                                                 batch_size=batch_size, shuffle=True)
     for batch in dataloader:
         ...
 

diff --git a/docs/source/guide_cn/distributed-apis.rst b/docs/source/guide_cn/distributed-apis.rst
@@ -177,9 +177,9 @@ DGL提供了一个稀疏的Adagrad优化器 :class:`~dgl.distributed.SparseAdagr
 DGL提供了两个级别的API，用于对节点和边进行采样以生成小批次训练数据(请参阅小批次训练的章节)。
 底层API要求用户编写代码以明确定义如何对节点层进行采样(例如，使用 :func:`dgl.sampling.sample_neighbors` )。
 高层采样API为节点分类和链接预测任务实现了一些流行的采样算法（例如
-:class:`~dgl.dataloading.pytorch.NodeDataloader`
+:class:`~dgl.dataloading.pytorch.NodeDataLoader`
 和
-:class:`~dgl.dataloading.pytorch.EdgeDataloader` )。
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` )。
 
 分布式采样模块遵循相同的设计，也提供两个级别的采样API。对于底层的采样API，它为
 :class:`~dgl.distributed.DistGraph` 上的分布式邻居采样提供了
@@ -188,7 +188,7 @@ DGL提供了两个级别的API，用于对节点和边进行采样以生成小
 分布式数据加载器具有与PyTorch DataLoader相同的接口。其中的工作进程(worker)在 :func:`dgl.distributed.initialize` 中创建。
 
 **Note**: 在 :class:`~dgl.distributed.DistGraph` 上运行 :func:`dgl.distributed.sample_neighbors` 时，
-采样器无法在具有多个工作进程的PyTorch Dataloader中运行。主要原因是PyTorch Dataloader在每个训练周期都会创建新的采样工作进程，
+采样器无法在具有多个工作进程的PyTorch DataLoader中运行。主要原因是PyTorch DataLoader在每个训练周期都会创建新的采样工作进程，
 从而导致多次创建和删除 :class:`~dgl.distributed.DistGraph` 对象。
 
 使用底层API时，采样代码类似于单进程采样。唯一的区别是用户需要使用
@@ -214,19 +214,19 @@ DGL提供了两个级别的API，用于对节点和边进行采样以生成小
         for batch in dataloader:
             ...
 
-:class:`~dgl.DGLGraph` 和 :class:`~dgl.distributed.DistGraph` 都可以使用相同的高级采样API(
-:class:`~dgl.dataloading.pytorch.NodeDataloader`
+:class:`~dgl.dataloading.pytorch.NodeDataLoader`
 和
-:class:`~dgl.dataloading.pytorch.EdgeDataloader`)。使用
-:class:`~dgl.dataloading.pytorch.NodeDataloader`
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader` 有分布式的版本
+:class:`~dgl.dataloading.pytorch.DistNodeDataLoader`
 和
-:class:`~dgl.dataloading.pytorch.EdgeDataloader` 时，分布式采样代码与单进程采样完全相同。
+:class:`~dgl.dataloading.pytorch.DistEdgeDataLoader` 。使用
+时分布式采样代码与单进程采样几乎完全相同。
 
 .. code:: python
 
     sampler = dgl.sampling.MultiLayerNeighborSampler([10, 25])
-    dataloader = dgl.sampling.NodeDataLoader(g, train_nid, sampler,
-                                             batch_size=batch_size, shuffle=True)
+    dataloader = dgl.sampling.DistNodeDataLoader(g, train_nid, sampler,
+                                                 batch_size=batch_size, shuffle=True)
     for batch in dataloader:
         ...
 

diff --git a/docs/source/guide_ko/distributed-apis.rst b/docs/source/guide_ko/distributed-apis.rst
@@ -132,8 +132,8 @@ DGL은 노드 임베딩들을 필요로 하는 변환 모델(transductive models
 분산 샘플링
 ~~~~~~~~
 
-DGL은 미니-배치를 생성하기 위해 노드 및 에지 샘플링을 하는 두 수준의 API를 제공한다 (미니-배치 학습 섹션 참조). Low-level API는 노드들의 레이어가 어떻게 샘플링될지를 명시적으로 정의하는 코드를 직접 작성해야한다 (예를 들면, :func:`dgl.sampling.sample_neighbors` 사용해서). High-level API는 노드 분류 및 링크 예측(예, :class:`~dgl.dataloading.pytorch.NodeDataloader` 와
-:class:`~dgl.dataloading.pytorch.EdgeDataloader`) 에 사용되는 몇 가지 유명한 샘플링 알고리즘을 구현하고 있다.
+DGL은 미니-배치를 생성하기 위해 노드 및 에지 샘플링을 하는 두 수준의 API를 제공한다 (미니-배치 학습 섹션 참조). Low-level API는 노드들의 레이어가 어떻게 샘플링될지를 명시적으로 정의하는 코드를 직접 작성해야한다 (예를 들면, :func:`dgl.sampling.sample_neighbors` 사용해서). High-level API는 노드 분류 및 링크 예측(예, :class:`~dgl.dataloading.pytorch.NodeDataLoader` 와
+:class:`~dgl.dataloading.pytorch.EdgeDataLoader`) 에 사용되는 몇 가지 유명한 샘플링 알고리즘을 구현하고 있다.
 
 분산 샘플링 모듈도 같은 디자인을 따르고 있고, 두 level의 샘플링 API를 제공한다. Low-level 샘플링 API의 경우, :class:`~dgl.distributed.DistGraph` 에 대한 분산 이웃 샘플링을 위해 :func:`~dgl.distributed.sample_neighbors` 가 있다. 또한, DGL은 분산 샘플링을 위해 분산 데이터 로더, :class:`~dgl.distributed.DistDataLoader` 를 제공한다. 분산 DataLoader는 PyTorch DataLoader와 같은 인터페이스를 갖는데, 다른 점은 사용자가 데이터 로더를 생성할 때 worker 프로세스의 개수를 지정할 수 없다는 점이다. Worker 프로세스들은 :func:`dgl.distributed.initialize` 에서 만들어진다.
 
@@ -159,13 +159,13 @@ Low-level API를 사용할 때, 샘플링 코드는 단일 프로세스 샘플
         for batch in dataloader:
             ...
 
-동일한 high-level 샘플링 API들(:class:`~dgl.dataloading.pytorch.NodeDataloader` 와 :class:`~dgl.dataloading.pytorch.EdgeDataloader` )이 :class:`~dgl.DGLGraph` 와 :class:`~dgl.distributed.DistGraph` 에 대해서 동작한다. :class:`~dgl.dataloading.pytorch.NodeDataloader` 과 :class:`~dgl.dataloading.pytorch.EdgeDataloader` 를 사용할 때, 분산 샘플링 코드는 싱글-프로세스 샘플링 코드와 정확하게 같다.
+동일한 high-level 샘플링 API들(:class:`~dgl.dataloading.pytorch.NodeDataLoader` 와 :class:`~dgl.dataloading.pytorch.EdgeDataLoader` )이 :class:`~dgl.DGLGraph` 와 :class:`~dgl.distributed.DistGraph` 에 대해서 동작한다. :class:`~dgl.dataloading.pytorch.NodeDataLoader` 과 :class:`~dgl.dataloading.pytorch.EdgeDataLoader` 를 사용할 때, 분산 샘플링 코드는 싱글-프로세스 샘플링 코드와 정확하게 같다.
 
 .. code:: python
 
     sampler = dgl.sampling.MultiLayerNeighborSampler([10, 25])
-    dataloader = dgl.sampling.NodeDataLoader(g, train_nid, sampler,
-                                             batch_size=batch_size, shuffle=True)
+    dataloader = dgl.sampling.DistNodeDataLoader(g, train_nid, sampler,
+                                                 batch_size=batch_size, shuffle=True)
     for batch in dataloader:
         ...
 

diff --git a/examples/pytorch/__temporary__/cluster_gcn/cluster_gcn.py b/examples/pytorch/__temporary__/cluster_gcn/cluster_gcn.py
@@ -0,0 +1,88 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchmetrics.functional as MF
+import dgl
+import dgl.nn as dglnn
+import time
+import numpy as np
+from ogb.nodeproppred import DglNodePropPredDataset
+
+USE_WRAPPER = True
+
+class SAGE(nn.Module):
+    def __init__(self, in_feats, n_hidden, n_classes):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, 'mean'))
+        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, 'mean'))
+        self.dropout = nn.Dropout(0.5)
+
+    def forward(self, sg, x):
+        h = x
+        for l, layer in enumerate(self.layers):
+            h = layer(sg, h)
+            if l != len(self.layers) - 1:
+                h = F.relu(h)
+                h = self.dropout(h)
+        return h
+
+dataset = DglNodePropPredDataset('ogbn-products')
+graph, labels = dataset[0]
+graph.ndata['label'] = labels
+split_idx = dataset.get_idx_split()
+train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
+graph.ndata['train_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, train_idx, True)
+graph.ndata['valid_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, valid_idx, True)
+graph.ndata['test_mask'] = torch.zeros(graph.num_nodes(), dtype=torch.bool).index_fill_(0, test_idx, True)
+
+model = SAGE(graph.ndata['feat'].shape[1], 256, dataset.num_classes).cuda()
+opt = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=5e-4)
+
+if USE_WRAPPER:
+    import dglnew
+    graph.create_formats_()
+    graph = dglnew.graph.wrapper.DGLGraphStorage(graph)
+
+num_partitions = 1000
+sampler = dgl.dataloading.ClusterGCNSampler(
+        graph, num_partitions,
+        prefetch_node_feats=['feat', 'label', 'train_mask', 'valid_mask', 'test_mask'])
+# DataLoader for generic dataloading with a graph, a set of indices (any indices, like
+# partition IDs here), and a graph sampler.
+# NodeDataLoader and EdgeDataLoader are simply special cases of DataLoader where the
+# indices are guaranteed to be node and edge IDs.
+dataloader = dgl.dataloading.DataLoader(
+        graph,
+        torch.arange(num_partitions),
+        sampler,
+        device='cuda',
+        batch_size=100,
+        shuffle=True,
+        drop_last=False,
+        pin_memory=True,
+        num_workers=8,
+        persistent_workers=True,
+        use_prefetch_thread=True)       # TBD: could probably remove this argument
+
+durations = []
+for _ in range(10):
+    t0 = time.time()
+    for it, sg in enumerate(dataloader):
+        x = sg.ndata['feat']
+        y = sg.ndata['label'][:, 0]
+        m = sg.ndata['train_mask']
+        y_hat = model(sg, x)
+        loss = F.cross_entropy(y_hat[m], y[m])
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+        if it % 20 == 0:
+            acc = MF.accuracy(y_hat[m], y[m])
+            mem = torch.cuda.max_memory_allocated() / 1000000
+            print('Loss', loss.item(), 'Acc', acc.item(), 'GPU Mem', mem, 'MB')
+    tt = time.time()
+    print(tt - t0)
+    durations.append(tt - t0)
+print(np.mean(durations[4:]), np.std(durations[4:]))
diff --git a/examples/pytorch/__temporary__/dglnew/__init__.py b/examples/pytorch/__temporary__/dglnew/__init__.py
@@ -0,0 +1,2 @@
+from . import graph
+from . import storages
diff --git a/examples/pytorch/__temporary__/dglnew/graph/__init__.py b/examples/pytorch/__temporary__/dglnew/graph/__init__.py
@@ -0,0 +1,3 @@
+from .graph import *
+from .other_feature import *
+from .wrapper import *
diff --git a/examples/pytorch/__temporary__/dglnew/graph/graph.py b/examples/pytorch/__temporary__/dglnew/graph/graph.py
@@ -0,0 +1,150 @@
+class GraphStorage(object):
+    def get_node_storage(self, key, ntype=None):
+        pass
+
+    def get_edge_storage(self, key, etype=None):
+        pass
+
+    # Required for checking whether a single dict is allowed for ndata and edata.
+    @property
+    def ntypes(self):
+        pass
+
+    @property
+    def canonical_etypes(self):
+        pass
+
+    def etypes(self):
+        return [etype[1] for etype in self.canonical_etypes]
+
+    def sample_neighbors(self, seed_nodes, fanout, edge_dir='in', prob=None,
+                         exclude_edges=None, replace=False, output_device=None):
+        """Return a DGLGraph which is a subgraph induced by sampling neighboring edges of
+        the given nodes.
+
+        See ``dgl.sampling.sample_neighbors`` for detailed semantics.
+
+        Parameters
+        ----------
+        seed_nodes : Tensor or dict[str, Tensor]
+            Node IDs to sample neighbors from.
+
+            This argument can take a single ID tensor or a dictionary of node types and ID tensors.
+            If a single tensor is given, the graph must only have one type of nodes.
+        fanout : int or dict[etype, int]
+            The number of edges to be sampled for each node on each edge type.
+
+            This argument can take a single int or a dictionary of edge types and ints.
+            If a single int is given, DGL will sample this number of edges for each node for
+            every edge type.
+
+            If -1 is given for a single edge type, all the neighboring edges with that edge
+            type will be selected.
+        prob : str, optional
+            Feature name used as the (unnormalized) probabilities associated with each
+            neighboring edge of a node.  The feature must have only one element for each
+            edge.
+
+            The features must be non-negative floats, and the sum of the features of
+            inbound/outbound edges for every node must be positive (though they don't have
+            to sum up to one).  Otherwise, the result will be undefined.
+
+            If :attr:`prob` is not None, GPU sampling is not supported.
+        exclude_edges: tensor or dict
+            Edge IDs to exclude during sampling neighbors for the seed nodes.
+
+            This argument can take a single ID tensor or a dictionary of edge types and ID tensors.
+            If a single tensor is given, the graph must only have one type of nodes.
+        replace : bool, optional
+            If True, sample with replacement.
+        output_device : Framework-specific device context object, optional
+            The output device.  Default is the same as the input graph.
+
+        Returns
+        -------
+        DGLGraph
+            A sampled subgraph with the same nodes as the original graph, but only the sampled neighboring
+            edges.  The induced edge IDs will be in ``edata[dgl.EID]``.
+        """
+        pass
+
+    # Required in Cluster-GCN
+    def subgraph(self, nodes, relabel_nodes=False, output_device=None):
+        """Return a subgraph induced on given nodes.
+
+        This has the same semantics as ``dgl.node_subgraph``.
+
+        Parameters
+        ----------
+        nodes : nodes or dict[str, nodes]
+            The nodes to form the subgraph. The allowed nodes formats are:
+
+            * Int Tensor: Each element is a node ID. The tensor must have the same device type
+              and ID data type as the graph's.
+            * iterable[int]: Each element is a node ID.
+            * Bool Tensor: Each :math:`i^{th}` element is a bool flag indicating whether
+              node :math:`i` is in the subgraph.
+
+            If the graph is homogeneous, one can directly pass the above formats.
+            Otherwise, the argument must be a dictionary with keys being node types
+            and values being the node IDs in the above formats.
+        relabel_nodes : bool, optional
+            If True, the extracted subgraph will only have the nodes in the specified node set
+            and it will relabel the nodes in order.
+        output_device : Framework-specific device context object, optional
+            The output device.  Default is the same as the input graph.
+
+        Returns
+        -------
+        DGLGraph
+            The subgraph.
+        """
+        pass
+
+    # Required in Link Prediction
+    def edge_subgraph(self, edges, relabel_nodes=False, output_device=None):
+        """Return a subgraph induced on given edges.
+
+        This has the same semantics as ``dgl.edge_subgraph``.
+
+        Parameters
+        ----------
+        edges : edges or dict[(str, str, str), edges]
+            The edges to form the subgraph. The allowed edges formats are:
+
+            * Int Tensor: Each element is an edge ID. The tensor must have the same device type
+              and ID data type as the graph's.
+            * iterable[int]: Each element is an edge ID.
+            * Bool Tensor: Each :math:`i^{th}` element is a bool flag indicating whether
+              edge :math:`i` is in the subgraph.
+
+            If the graph is homogeneous, one can directly pass the above formats.
+            Otherwise, the argument must be a dictionary with keys being edge types
+            and values being the edge IDs in the above formats.
+        relabel_nodes : bool, optional
+            If True, the extracted subgraph will only have the nodes in the specified node set
+            and it will relabel the nodes in order.
+        output_device : Framework-specific device context object, optional
+            The output device.  Default is the same as the input graph.
+
+        Returns
+        -------
+        DGLGraph
+            The subgraph.
+        """
+        pass
+
+    # Required in Link Prediction negative sampler
+    def find_edges(self, edges, etype=None, output_device=None):
+        """Return the source and destination node IDs given the edge IDs within the given edge type.
+        """
+        pass
+
+    # Required in Link Prediction negative sampler
+    def num_nodes(self, ntype):
+        """Return the number of nodes for the given node type."""
+        pass
+
+    def global_uniform_negative_sampling(self, num_samples, exclude_self_loops=True,
+                                         replace=False, etype=None):
+        """Per source negative sampling as in ``dgl.dataloading.GlobalUniform``"""