Skip to content

Commit

Permalink
[Feature] add NodeFlow API (dmlc#361)
Browse files Browse the repository at this point in the history
* sample layer subgraphs.

* fix.

* fix.

* add layered subgraph.

* fix lint.

* fix.

* fix tutorial.

* fix.

* remove copy_to_parent.

* add num_layers

* move sampling code to sampler.cc

* fix.

* move subgraph construction out.

* Revert "move subgraph construction out."

This reverts commit 24b3d13.

* change to NodeFlow.

* use NodeFlow in Python.

* use NodeFlowIndex.

* add node_mapping and edge_mapping.

* remove unnecessary code in SSE tutorial.

* Revert "remove unnecessary code in SSE tutorial."

This reverts commit 093f041.

* fix tutorial.

* move to node_flow.

* update gcn cv updater.

* import NodeFlow.

* update.

* add demo code for vanilla control variate sampler.

* update.

* update.

* add neighbor sampling.

* return flow offsets.

* update node_flow.

* add test.

* fix sampler.

* fix graph index.

* fix a bug in sampler.

* fix map_to_layer_nid and map_to_flow_eid.

* fix apply_flow.

* remove model code.

* implement flow_compute.

* fix a bug.

* reverse the csr physically.

* add mini-batch test.

* add mini batch test.

* update flow_compute.

* add prop_flows

* run on specific nodes.

* test copy

* fix a bug in creating frame in NodeFlow.

* add init gcn_cv_updater.

* fix a minor bug.

* fix gcn_cv_updater.

* fix a bug.

* fix a bug in NodeFlow.

* use new h in gcn_cv_updater.

* add layer_in_degree and layer_out_degree.

* fix gcn_cv_updater for gpu.

* temp fix in NodeFlow for diff context.

* allow enabling/disabling copy back.

* add with-updater option.

* fix a bug in computing degree.

* add with-cv option.

* rename and add comments.

* fix lint complain.

* fix lint.

* avoid assert.

* remove assert.

* fix.

* fix.

* fix.

* fix.

* fix the methods in NodeFlow.

* fix lint.

* update SSE.

* remove gcn_cv_updater.

* correct comments for the schedulers.

* update comment.

* add map_to_nodeflow_nid

* address comment.

* remove duplicated test.

* fix int.

* fix comments.

* fix lint

* fix.

* replace subgraph with NodeFlow.

* move view.

* address comments.

* fix lint.

* fix lint.

* remove static_cast.

* fix docstring.

* fix comments.

* break SampleSubgraph.

* move neighbor sampling to sampler.cc

* fix comments.

* rename.

* split neighbor_list.

* address comments.

* fix.

* remove TODO.
  • Loading branch information
zheng-da authored Feb 19, 2019
1 parent 220a1e6 commit f370e62
Show file tree
Hide file tree
Showing 20 changed files with 2,812 additions and 1,376 deletions.
4 changes: 2 additions & 2 deletions examples/mxnet/sse/sse_batch.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,7 @@ def main(args, data):
dur = []
sampler = dgl.contrib.sampling.NeighborSampler(g, args.batch_size, neigh_expand,
neighbor_type='in', num_workers=args.num_parallel_subgraphs, seed_nodes=train_vs,
shuffle=True, return_seed_id=True)
shuffle=True)
if args.cache_subgraph:
sampler = CachedSubgraphLoader(sampler, shuffle=True)
for epoch in range(args.n_epochs):
Expand All @@ -279,7 +279,7 @@ def main(args, data):
start1 = time.time()
for subg, aux_infos in sampler:
seeds = aux_infos['seeds']
subg_seeds = subg.map_to_subgraph_nid(seeds)
subg_seeds = subg.layer_nid(0)
subg.copy_from_parent()

losses = []
Expand Down
12 changes: 1 addition & 11 deletions include/dgl/graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#ifndef DGL_GRAPH_H_
#define DGL_GRAPH_H_

#include <string>
#include <vector>
#include <string>
#include <cstdint>
Expand Down Expand Up @@ -369,17 +370,6 @@ class Graph: public GraphInterface {
*/
virtual std::vector<IdArray> GetAdj(bool transpose, const std::string &fmt) const;

/*!
* \brief Sample a subgraph from the seed vertices with neighbor sampling.
* The neighbors are sampled with a uniform distribution.
* \return a subgraph
*/
virtual SampledSubgraph NeighborUniformSample(IdArray seeds, const std::string &neigh_type,
int num_hops, int expand_factor) const {
LOG(FATAL) << "NeighborUniformSample isn't supported in mutable graph";
return SampledSubgraph();
}

protected:
friend class GraphOp;
/*! \brief Internal edge list type */
Expand Down
25 changes: 1 addition & 24 deletions include/dgl/graph_interface.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ typedef dgl::runtime::NDArray BoolArray;
typedef dgl::runtime::NDArray IntArray;

struct Subgraph;
struct SampledSubgraph;
struct NodeFlow;

/*!
* \brief This class references data in std::vector.
Expand Down Expand Up @@ -332,14 +332,6 @@ class GraphInterface {
* \return a vector of IdArrays.
*/
virtual std::vector<IdArray> GetAdj(bool transpose, const std::string &fmt) const = 0;

/*!
* \brief Sample a subgraph from the seed vertices with neighbor sampling.
* The neighbors are sampled with a uniform distribution.
* \return a subgraph
*/
virtual SampledSubgraph NeighborUniformSample(IdArray seeds, const std::string &neigh_type,
int num_hops, int expand_factor) const = 0;
};

/*! \brief Subgraph data structure */
Expand All @@ -358,21 +350,6 @@ struct Subgraph {
IdArray induced_edges;
};

/*!
* \brief When we sample a subgraph, we need to store extra information,
* such as the layer Ids of the vertices and the sampling probability.
*/
struct SampledSubgraph: public Subgraph {
/*!
* \brief the layer of a sampled vertex in the subgraph.
*/
IdArray layer_ids;
/*!
* \brief the probability that a vertex is sampled.
*/
runtime::NDArray sample_prob;
};

} // namespace dgl

#endif // DGL_GRAPH_INTERFACE_H_
25 changes: 9 additions & 16 deletions include/dgl/immutable_graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,11 @@ class ImmutableGraph: public GraphInterface {
return indices.size();
}

/* This gets the sum of vertex degrees in the range. */
uint64_t GetDegree(dgl_id_t start, dgl_id_t end) const {
return indptr[end] - indptr[start];
}

uint64_t GetDegree(dgl_id_t vid) const {
return indptr[vid + 1] - indptr[vid];
}
Expand Down Expand Up @@ -456,14 +461,6 @@ class ImmutableGraph: public GraphInterface {
return gptr;
}

/*!
* \brief Sample a subgraph from the seed vertices with neighbor sampling.
* The neighbors are sampled with a uniform distribution.
* \return a subgraph
*/
SampledSubgraph NeighborUniformSample(IdArray seeds, const std::string &neigh_type,
int num_hops, int expand_factor) const;

/*!
* \brief Get the adjacency matrix of the graph.
*
Expand All @@ -475,10 +472,6 @@ class ImmutableGraph: public GraphInterface {
*/
virtual std::vector<IdArray> GetAdj(bool transpose, const std::string &fmt) const;

protected:
DGLIdIters GetInEdgeIdRef(dgl_id_t src, dgl_id_t dst) const;
DGLIdIters GetOutEdgeIdRef(dgl_id_t src, dgl_id_t dst) const;

/*
* The immutable graph may only contain one of the CSRs (e.g., the sampled subgraphs).
* When we get in csr or out csr, we try to get the one cached in the structure.
Expand All @@ -503,6 +496,10 @@ class ImmutableGraph: public GraphInterface {
}
}

protected:
DGLIdIters GetInEdgeIdRef(dgl_id_t src, dgl_id_t dst) const;
DGLIdIters GetOutEdgeIdRef(dgl_id_t src, dgl_id_t dst) const;

/*!
* \brief Get the CSR array that represents the in-edges.
* This method copies data from std::vector to IdArray.
Expand All @@ -517,10 +514,6 @@ class ImmutableGraph: public GraphInterface {
*/
CSRArray GetOutCSRArray() const;

SampledSubgraph SampleSubgraph(IdArray seed_arr, const float* probability,
const std::string &neigh_type,
int num_hops, size_t num_neighbor) const;

/*!
* \brief Compact a subgraph.
* In a sampled subgraph, the vertex Id is still in the ones in the original graph.
Expand Down
64 changes: 64 additions & 0 deletions include/dgl/sampler.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
/*!
* Copyright (c) 2018 by Contributors
* \file dgl/sampler.h
* \brief DGL sampler header.
*/
#ifndef DGL_SAMPLER_H_
#define DGL_SAMPLER_H_

#include "graph_interface.h"

namespace dgl {

class ImmutableGraph;

/*!
* \brief A NodeFlow graph stores the sampling results for a sampler that samples
* nodes/edges in layers.
*
* We store multiple layers of the sampling results in a single graph, which results
* in a more compact format. We store extra information,
* such as the node and edge mapping from the NodeFlow graph to the parent graph.
*/
struct NodeFlow {
/*! \brief The graph. */
GraphPtr graph;
/*!
* \brief the offsets of each layer.
*/
IdArray layer_offsets;
/*!
* \brief the offsets of each flow.
*/
IdArray flow_offsets;
/*!
* \brief The node mapping from the NodeFlow graph to the parent graph.
*/
IdArray node_mapping;
/*!
* \brief The edge mapping from the NodeFlow graph to the parent graph.
*/
IdArray edge_mapping;
};

class SamplerOp {
public:
/*!
* \brief Sample a graph from the seed vertices with neighbor sampling.
* The neighbors are sampled with a uniform distribution.
*
* \param graphs A graph for sampling.
* \param seeds the nodes where we should start to sample.
* \param edge_type the type of edges we should sample neighbors.
* \param num_hops the number of hops to sample neighbors.
* \param expand_factor the max number of neighbors to sample.
* \return a NodeFlow graph.
*/
static NodeFlow NeighborUniformSample(const ImmutableGraph *graph, IdArray seeds,
const std::string &edge_type,
int num_hops, int expand_factor);
};

} // namespace dgl

#endif // DGL_SAMPLER_H_
59 changes: 29 additions & 30 deletions python/dgl/contrib/sampling/sampler.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This file contains subgraph samplers.
# This file contains NodeFlow samplers.

import sys
import numpy as np
Expand All @@ -7,7 +7,7 @@
import traceback

from ... import utils
from ...subgraph import DGLSubGraph
from ...node_flow import NodeFlow
from ... import backend as F
try:
import Queue as queue
Expand All @@ -22,7 +22,7 @@ def __init__(self, g, batch_size, expand_factor, num_hops=1,
shuffle=False, num_workers=1, return_seed_id=False):
self._g = g
if not g._graph.is_readonly():
raise NotImplementedError("subgraph loader only support read-only graphs.")
raise NotImplementedError("NodeFlow loader only support read-only graphs.")
self._batch_size = batch_size
self._expand_factor = expand_factor
self._num_hops = num_hops
Expand All @@ -39,45 +39,44 @@ def __init__(self, g, batch_size, expand_factor, num_hops=1,
self._seed_nodes = F.rand_shuffle(self._seed_nodes)
self._num_workers = num_workers
self._neighbor_type = neighbor_type
self._subgraphs = []
self._nflows = []
self._seed_ids = []
self._subgraph_idx = 0
self._nflow_idx = 0

def _prefetch(self):
seed_ids = []
num_nodes = len(self._seed_nodes)
for i in range(self._num_workers):
start = self._subgraph_idx * self._batch_size
start = self._nflow_idx * self._batch_size
# if we have visited all nodes, don't do anything.
if start >= num_nodes:
break
end = min((self._subgraph_idx + 1) * self._batch_size, num_nodes)
end = min((self._nflow_idx + 1) * self._batch_size, num_nodes)
seed_ids.append(utils.toindex(self._seed_nodes[start:end]))
self._subgraph_idx += 1
self._nflow_idx += 1
sgi = self._g._graph.neighbor_sampling(seed_ids, self._expand_factor,
self._num_hops, self._neighbor_type,
self._node_prob)
subgraphs = [DGLSubGraph(self._g, i.induced_nodes, i.induced_edges, \
i) for i in sgi]
self._subgraphs.extend(subgraphs)
nflows = [NodeFlow(self._g, i) for i in sgi]
self._nflows.extend(nflows)
if self._return_seed_id:
self._seed_ids.extend(seed_ids)

def __iter__(self):
return self

def __next__(self):
# If we don't have prefetched subgraphs, let's prefetch them.
if len(self._subgraphs) == 0:
# If we don't have prefetched NodeFlows, let's prefetch them.
if len(self._nflows) == 0:
self._prefetch()
# At this point, if we still don't have subgraphs, we must have
# iterate all subgraphs and we should stop the iterator now.
if len(self._subgraphs) == 0:
# At this point, if we still don't have NodeFlows, we must have
# iterate all NodeFlows and we should stop the iterator now.
if len(self._nflows) == 0:
raise StopIteration
aux_infos = {}
if self._return_seed_id:
aux_infos['seeds'] = self._seed_ids.pop(0).tousertensor()
return self._subgraphs.pop(0), aux_infos
return self._nflows.pop(0), aux_infos

class _Prefetcher(object):
"""Internal shared prefetcher logic. It can be sub-classed by a Thread-based implementation
Expand Down Expand Up @@ -199,28 +198,28 @@ def NeighborSampler(g, batch_size, expand_factor, num_hops=1,
return_seed_id=False, prefetch=False):
'''Create a sampler that samples neighborhood.
This creates a subgraph data loader that samples subgraphs from the input graph
This creates a NodeFlow loader that samples subgraphs from the input graph
with neighbor sampling. This sampling method is implemented in C and can perform
sampling very efficiently.
A subgraph grows from a seed vertex. It contains sampled neighbors
A NodeFlow grows from a seed vertex. It contains sampled neighbors
of the seed vertex as well as the edges that connect neighbor nodes with
seed nodes. When the number of hops is k (>1), the neighbors are sampled
from the k-hop neighborhood. In this case, the sampled edges are the ones
that connect the source nodes and the sampled neighbor nodes of the source
nodes.
The subgraph loader returns a list of subgraphs and a dictionary of additional
information about the subgraphs. The size of the subgraph list is the number of workers.
The NodeFlow loader returns a list of NodeFlows and a dictionary of additional
information about the NodeFlows. The size of the NodeFlow list is the number of workers.
The dictionary contains:
- seeds: a list of 1D tensors of seed Ids, if return_seed_id is True.
Parameters
----------
g: the DGLGraph where we sample subgraphs.
batch_size: The number of subgraphs in a batch.
g: the DGLGraph where we sample NodeFlows.
batch_size: The number of NodeFlows in a batch.
expand_factor: the number of neighbors sampled from the neighbor list
of a vertex. The value of this parameter can be
an integer: indicates the number of neighbors sampled from a neighbor list.
Expand All @@ -234,20 +233,20 @@ def NeighborSampler(g, batch_size, expand_factor, num_hops=1,
node_prob: the probability that a neighbor node is sampled.
1D Tensor. None means uniform sampling. Otherwise, the number of elements
should be the same as the number of vertices in the graph.
seed_nodes: a list of nodes where we sample subgraphs from.
seed_nodes: a list of nodes where we sample NodeFlows from.
If it's None, the seed vertices are all vertices in the graph.
shuffle: indicates the sampled subgraphs are shuffled.
num_workers: the number of worker threads that sample subgraphs in parallel.
return_seed_id: indicates whether to return seed ids along with the subgraphs.
shuffle: indicates the sampled NodeFlows are shuffled.
num_workers: the number of worker threads that sample NodeFlows in parallel.
return_seed_id: indicates whether to return seed ids along with the NodeFlows.
The seed Ids are in the parent graph.
prefetch : bool, default False
Whether to prefetch the samples in the next batch.
Returns
-------
A subgraph iterator
The iterator returns a list of batched subgraphs and a dictionary of additional
information about the subgraphs.
A NodeFlow iterator
The iterator returns a list of batched NodeFlows and a dictionary of additional
information about the NodeFlows.
'''
loader = NSSubgraphLoader(g, batch_size, expand_factor, num_hops, neighbor_type, node_prob,
seed_nodes, shuffle, num_workers, return_seed_id)
Expand Down
Loading

0 comments on commit f370e62

Please sign in to comment.