diff --git a/docs/source/api/python/dgl.dataloading.rst b/docs/source/api/python/dgl.dataloading.rst
index 292f8f87911d..608110fd6635 100644
--- a/docs/source/api/python/dgl.dataloading.rst
+++ b/docs/source/api/python/dgl.dataloading.rst
@@ -19,6 +19,7 @@ and an ``EdgeDataLoader`` for edge/link prediction task.
 .. autoclass:: GraphDataLoader
 
 .. _api-dataloading-neighbor-sampling:
+
 Neighbor Sampler
 -----------------------------
 .. currentmodule:: dgl.dataloading.neighbor
diff --git a/docs/source/api/python/dgl.rst b/docs/source/api/python/dgl.rst
index e02523eca3b4..b8ca7e266ae6 100644
--- a/docs/source/api/python/dgl.rst
+++ b/docs/source/api/python/dgl.rst
@@ -114,10 +114,33 @@ Utilities for computing adjacency matrix and Lapacian matrix.
     khop_adj
     laplacian_lambda_max
 
-Traversals
+Graph Traversal & Message Propagation
 ------------------------------------------
 
-Utilities for traversing graphs.
+DGL implements graph traversal algorithms implemented as python generators,
+which returns the visited set of nodes or edges (in ID tensor) at each iteration.
+The naming convention is ``<algorithm>_[nodes|edges]_generator``.
+An example usage is as follows.
+
+.. code:: python
+
+    g = ...  # some DGLGraph
+    for nodes in dgl.bfs_nodes_generator(g, 0):
+        do_something(nodes)
+
+.. autosummary::
+    :toctree: ../../generated/
+
+    bfs_nodes_generator
+    bfs_edges_generator
+    topological_nodes_generator
+    dfs_edges_generator
+    dfs_labeled_edges_generator
+
+DGL provides APIs to perform message passing following graph traversal order. ``prop_nodes_XXX``
+calls traversal algorithm ``XXX`` and triggers :func:`~DGLGraph.pull()` on the visited node
+set at each iteration. ``prop_edges_YYY`` applies traversal algorithm ``YYY`` and triggers
+:func:`~DGLGraph.send_and_recv()` on the visited edge set at each iteration.
 
 .. autosummary::
     :toctree: ../../generated/
diff --git a/docs/source/api/python/graph.rst b/docs/source/api/python/graph.rst
deleted file mode 100644
index 4721b751fd37..000000000000
--- a/docs/source/api/python/graph.rst
+++ /dev/null
@@ -1,169 +0,0 @@
-.. _apigraph:
-
-dgl.DGLGraph
-=========================================
-
-.. currentmodule:: dgl
-.. autoclass:: DGLGraph
-
-Adding nodes and edges
-----------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.add_nodes
-    DGLGraph.add_edge
-    DGLGraph.add_edges
-    DGLGraph.clear
-
-Querying graph structure
-------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.number_of_nodes
-    DGLGraph.number_of_edges
-    DGLGraph.__len__
-    DGLGraph.is_multigraph
-    DGLGraph.has_node
-    DGLGraph.has_nodes
-    DGLGraph.__contains__
-    DGLGraph.has_edge_between
-    DGLGraph.has_edges_between
-    DGLGraph.predecessors
-    DGLGraph.successors
-    DGLGraph.edge_id
-    DGLGraph.edge_ids
-    DGLGraph.find_edges
-    DGLGraph.in_edges
-    DGLGraph.out_edges
-    DGLGraph.all_edges
-    DGLGraph.in_degree
-    DGLGraph.in_degrees
-    DGLGraph.out_degree
-    DGLGraph.out_degrees
-
-Querying batch summary
-----------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.batch_size
-    DGLGraph.batch_num_nodes
-    DGLGraph.batch_num_edges
-
-Querying sub-graph/parent-graph belonging information
------------------------------------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.parent
-
-Removing nodes and edges
-------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-    
-    DGLGraph.remove_nodes
-    DGLGraph.remove_edges
-
-Transforming graph
-------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.subgraph
-    DGLGraph.subgraphs
-    DGLGraph.edge_subgraph
-    DGLGraph.line_graph
-    DGLGraph.reverse
-    DGLGraph.readonly
-    DGLGraph.flatten
-    DGLGraph.detach_parent
-
-Converting from/to other format
--------------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.to_networkx
-    DGLGraph.from_networkx
-    DGLGraph.from_scipy_sparse_matrix
-    DGLGraph.adjacency_matrix
-    DGLGraph.adjacency_matrix_scipy
-    DGLGraph.incidence_matrix
-
-Using Node/edge features
-------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.nodes
-    DGLGraph.edges
-    DGLGraph.ndata
-    DGLGraph.edata
-    DGLGraph.node_attr_schemes
-    DGLGraph.edge_attr_schemes
-    DGLGraph.set_n_initializer
-    DGLGraph.set_e_initializer
-    DGLGraph.local_var
-    DGLGraph.local_scope
-
-Computing with DGLGraph
------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.register_message_func
-    DGLGraph.register_reduce_func
-    DGLGraph.register_apply_node_func
-    DGLGraph.register_apply_edge_func
-    DGLGraph.apply_nodes
-    DGLGraph.apply_edges
-    DGLGraph.group_apply_edges
-    DGLGraph.send
-    DGLGraph.recv
-    DGLGraph.send_and_recv
-    DGLGraph.pull
-    DGLGraph.push
-    DGLGraph.update_all
-    DGLGraph.prop_nodes
-    DGLGraph.prop_edges
-    DGLGraph.filter_nodes
-    DGLGraph.filter_edges
-    DGLGraph.to
-
-Batch and Unbatch
--------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    batch
-    unbatch
-
-Mapping between subgraph and parent graph
------------------------------------------
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.parent_nid
-    DGLGraph.parent_eid
-    DGLGraph.map_to_subgraph_nid  
-
-Synchronize features between subgraph and parent graph
-------------------------------------------------------
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLGraph.copy_from_parent
-    DGLGraph.copy_to_parent
diff --git a/docs/source/api/python/graph_store.rst b/docs/source/api/python/graph_store.rst
deleted file mode 100644
index 916e38bf7f9d..000000000000
--- a/docs/source/api/python/graph_store.rst
+++ /dev/null
@@ -1,47 +0,0 @@
-.. _apigraphstore:
-
-Graph Store -- Graph for multi-processing and distributed training
-==================================================================
-
-.. currentmodule:: dgl.contrib.graph_store
-.. autoclass:: SharedMemoryDGLGraph
-
-Querying the distributed setting
---------------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    SharedMemoryDGLGraph.num_workers
-    SharedMemoryDGLGraph.worker_id
-    SharedMemoryDGLGraph.destroy
-
-Using Node/edge features
-------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    SharedMemoryDGLGraph.init_ndata
-    SharedMemoryDGLGraph.init_edata
-
-Computing with Graph store
---------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    SharedMemoryDGLGraph.apply_nodes
-    SharedMemoryDGLGraph.apply_edges
-    SharedMemoryDGLGraph.group_apply_edges
-    SharedMemoryDGLGraph.recv
-    SharedMemoryDGLGraph.send_and_recv
-    SharedMemoryDGLGraph.pull
-    SharedMemoryDGLGraph.push
-    SharedMemoryDGLGraph.update_all
-
-Construct a graph store
------------------------
-
-.. autofunction:: dgl.contrib.graph_store.create_graph_store_server
-.. autofunction:: dgl.contrib.graph_store.create_graph_from_store
diff --git a/docs/source/api/python/heterograph.rst b/docs/source/api/python/heterograph.rst
deleted file mode 100644
index 86002d02bb08..000000000000
--- a/docs/source/api/python/heterograph.rst
+++ /dev/null
@@ -1,140 +0,0 @@
-.. _apiheterograph:
-
-dgl.DGLHeteroGraph
-=====================================================
-
-.. currentmodule:: dgl
-.. autoclass:: DGLHeteroGraph
-
-Conversion to and from heterogeneous graphs
------------------------------------------
-
-.. automodule:: dgl.convert
-.. currentmodule:: dgl
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    graph
-    bipartite
-    hetero_from_relations
-    heterograph
-    to_hetero
-    to_homo
-    to_networkx
-    DGLHeteroGraph.adjacency_matrix
-    DGLHeteroGraph.incidence_matrix
-
-Querying metagraph structure
-----------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLHeteroGraph.ntypes
-    DGLHeteroGraph.etypes
-    DGLHeteroGraph.canonical_etypes
-    DGLHeteroGraph.metagraph
-    DGLHeteroGraph.to_canonical_etype
-    DGLHeteroGraph.get_ntype_id
-    DGLHeteroGraph.get_etype_id
-
-Querying graph structure
-------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLHeteroGraph.number_of_nodes
-    DGLHeteroGraph.number_of_edges
-    DGLHeteroGraph.is_multigraph
-    DGLHeteroGraph.is_readonly
-    DGLHeteroGraph.has_node
-    DGLHeteroGraph.has_nodes
-    DGLHeteroGraph.has_edge_between
-    DGLHeteroGraph.has_edges_between
-    DGLHeteroGraph.predecessors
-    DGLHeteroGraph.successors
-    DGLHeteroGraph.edge_id
-    DGLHeteroGraph.edge_ids
-    DGLHeteroGraph.find_edges
-    DGLHeteroGraph.in_edges
-    DGLHeteroGraph.out_edges
-    DGLHeteroGraph.all_edges
-    DGLHeteroGraph.in_degree
-    DGLHeteroGraph.in_degrees
-    DGLHeteroGraph.out_degree
-    DGLHeteroGraph.out_degrees
-
-Querying and manipulating sparse format
----------------------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLHeteroGraph.format_in_use
-    DGLHeteroGraph.restrict_format
-    DGLHeteroGraph.to_format
-
-Querying and manipulating index data type
------------------------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLHeteroGraph.idtype
-    DGLHeteroGraph.long
-    DGLHeteroGraph.int
-
-Using Node/edge features
-------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLHeteroGraph.nodes
-    DGLHeteroGraph.ndata
-    DGLHeteroGraph.edges
-    DGLHeteroGraph.edata
-    DGLHeteroGraph.node_attr_schemes
-    DGLHeteroGraph.edge_attr_schemes
-    DGLHeteroGraph.set_n_initializer
-    DGLHeteroGraph.set_e_initializer
-    DGLHeteroGraph.local_var
-    DGLHeteroGraph.local_scope
-
-Transforming graph
-------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLHeteroGraph.subgraph
-    DGLHeteroGraph.edge_subgraph
-    DGLHeteroGraph.node_type_subgraph
-    DGLHeteroGraph.edge_type_subgraph
-
-Computing with DGLHeteroGraph
------------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    DGLHeteroGraph.apply_nodes
-    DGLHeteroGraph.apply_edges
-    DGLHeteroGraph.group_apply_edges
-    DGLHeteroGraph.send
-    DGLHeteroGraph.recv
-    DGLHeteroGraph.multi_recv
-    DGLHeteroGraph.send_and_recv
-    DGLHeteroGraph.multi_send_and_recv
-    DGLHeteroGraph.pull
-    DGLHeteroGraph.multi_pull
-    DGLHeteroGraph.push
-    DGLHeteroGraph.update_all
-    DGLHeteroGraph.multi_update_all
-    DGLHeteroGraph.prop_nodes
-    DGLHeteroGraph.prop_edges
-    DGLHeteroGraph.filter_nodes
-    DGLHeteroGraph.filter_edges
-    DGLHeteroGraph.to
diff --git a/docs/source/api/python/init.rst b/docs/source/api/python/init.rst
deleted file mode 100644
index ad2faa8b06a7..000000000000
--- a/docs/source/api/python/init.rst
+++ /dev/null
@@ -1,11 +0,0 @@
-.. _apiinit:
-
-Feature Initializer
-===================
-
-.. automodule:: dgl.init
-.. autosummary::
-   :toctree: ../../generated/
-
-   base_initializer
-   zero_initializer
diff --git a/docs/source/api/python/nn.functional.rst b/docs/source/api/python/nn.functional.rst
new file mode 100644
index 000000000000..7fcd8e8ee6b8
--- /dev/null
+++ b/docs/source/api/python/nn.functional.rst
@@ -0,0 +1,11 @@
+.. _apinn-functional:
+
+dgl.nn.functional
+=================
+
+.. automodule:: dgl.nn.functional
+
+.. autosummary::
+    :toctree: ../../generated/
+
+   edge_softmax
diff --git a/docs/source/api/python/nn.rst b/docs/source/api/python/nn.rst
index eccdea7d93fd..a0a77fce886d 100644
--- a/docs/source/api/python/nn.rst
+++ b/docs/source/api/python/nn.rst
@@ -10,20 +10,3 @@ dgl.nn
    nn.pytorch
    nn.mxnet
    nn.tensorflow
-
-dgl.nn.functional
-=================
-
-Edge Softmax module
--------------------
-
-We also provide framework agnostic edge softmax module which was frequently used in
-GNN-like structures, e.g. 
-`Graph Attention Network <https://arxiv.org/pdf/1710.10903.pdf>`_,
-`Transformer <https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>`_,
-`Capsule <https://arxiv.org/pdf/1710.09829.pdf>`_, etc.
-
-.. autosummary::
-    :toctree: ../../generated/
-
-   functional.edge_softmax
diff --git a/docs/source/api/python/nn.tensorflow.rst b/docs/source/api/python/nn.tensorflow.rst
index 1b9aa2373282..b93ea21d3cdc 100644
--- a/docs/source/api/python/nn.tensorflow.rst
+++ b/docs/source/api/python/nn.tensorflow.rst
@@ -1,7 +1,7 @@
 .. _apinn-tensorflow:
 
 NN Modules (Tensorflow)
-====================
+====================================
 
 .. _apinn-tensorflow-conv:
 
diff --git a/docs/source/api/python/nodeflow.rst b/docs/source/api/python/nodeflow.rst
deleted file mode 100644
index c38b93842433..000000000000
--- a/docs/source/api/python/nodeflow.rst
+++ /dev/null
@@ -1,83 +0,0 @@
-.. _apinodeflow:
-
-dgl.nodeflow (Deprecating)
-==============
-
-.. warning::
-   This module is going to be deprecated in favor of :ref:`api-sampling`.
-
-.. currentmodule:: dgl
-.. autoclass:: NodeFlow
-
-Querying graph structure
-------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    NodeFlow.num_layers
-    NodeFlow.num_blocks
-    NodeFlow.layer_size
-    NodeFlow.block_size
-    NodeFlow.layer_in_degree
-    NodeFlow.layer_out_degree
-    NodeFlow.layer_nid
-    NodeFlow.layer_parent_nid
-    NodeFlow.block_eid
-    NodeFlow.block_parent_eid
-    NodeFlow.block_edges
-
-Converting to other format
--------------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    NodeFlow.block_adjacency_matrix
-    NodeFlow.block_incidence_matrix
-
-Using Node/edge features
-------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    NodeFlow.layers
-    NodeFlow.blocks
-    NodeFlow.set_n_initializer
-    NodeFlow.set_e_initializer
-    NodeFlow.node_attr_schemes
-    NodeFlow.edge_attr_schemes
-
-Mapping between NodeFlow and parent graph
------------------------------------------
-.. autosummary::
-    :toctree: ../../generated/
-
-    NodeFlow.map_to_parent_nid
-    NodeFlow.map_to_parent_eid
-    NodeFlow.map_from_parent_nid
-
-
-Synchronize features between NodeFlow and parent graph
-------------------------------------------------------
-.. autosummary::
-    :toctree: ../../generated/
-
-    NodeFlow.copy_from_parent
-    NodeFlow.copy_to_parent
-
-Computing with NodeFlow
------------------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    NodeFlow.register_message_func
-    NodeFlow.register_reduce_func
-    NodeFlow.register_apply_node_func
-    NodeFlow.register_apply_edge_func
-    NodeFlow.apply_layer
-    NodeFlow.apply_block
-    NodeFlow.block_compute
-    NodeFlow.prop_flow
diff --git a/docs/source/api/python/propagate.rst b/docs/source/api/python/propagate.rst
deleted file mode 100644
index 1cc5c6045e58..000000000000
--- a/docs/source/api/python/propagate.rst
+++ /dev/null
@@ -1,18 +0,0 @@
-dgl.propagate
-===============
-
-.. automodule:: dgl.propagate
-
-Propagate messages and perform computation following graph traversal order. ``prop_nodes_XXX``
-calls traversal algorithm ``XXX`` and triggers :func:`~DGLGraph.pull()` on the visited node
-set at each iteration. ``prop_edges_YYY`` applies traversal algorithm ``YYY`` and triggers
-:func:`~DGLGraph.send_and_recv()` on the visited edge set at each iteration.
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    prop_nodes
-    prop_edges
-    prop_nodes_bfs
-    prop_nodes_topo
-    prop_edges_dfs
diff --git a/docs/source/api/python/random.rst b/docs/source/api/python/random.rst
deleted file mode 100644
index 7711b98a9f24..000000000000
--- a/docs/source/api/python/random.rst
+++ /dev/null
@@ -1,13 +0,0 @@
-.. _apirandom:
-
-dgl.random
-====================================
-
-.. automodule:: dgl.random
-
-Utilities used to control DGL's random number generator.
-
-.. autosummary::
-    :toctree: ../../generated
-
-    seed
diff --git a/docs/source/api/python/readout.rst b/docs/source/api/python/readout.rst
deleted file mode 100644
index 9998481fe429..000000000000
--- a/docs/source/api/python/readout.rst
+++ /dev/null
@@ -1,25 +0,0 @@
-.. _apibatch:
-
-dgl.readout
-==================================================
-
-.. currentmodule:: dgl
-
-Graph Readout
--------------
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    sum_nodes
-    sum_edges
-    mean_nodes
-    mean_edges
-    max_nodes
-    max_edges
-    topk_nodes
-    topk_edges
-    softmax_nodes
-    softmax_edges
-    broadcast_nodes
-    broadcast_edges
diff --git a/docs/source/api/python/sampler.rst b/docs/source/api/python/sampler.rst
deleted file mode 100644
index 3dac4ac5c302..000000000000
--- a/docs/source/api/python/sampler.rst
+++ /dev/null
@@ -1,43 +0,0 @@
-.. apisampler
-
-dgl.contrib.sampling (Deprecating)
-======================
-
-.. warning::
-   This module is going to be deprecated in favor of :ref:`api-sampling`.
-
-Module for sampling algorithms on graph. Each algorithm is implemented as a
-data loader, which produces sampled subgraphs (called Nodeflow) at each
-iteration.
-
-.. autofunction:: dgl.contrib.sampling.sampler.NeighborSampler
-.. autofunction:: dgl.contrib.sampling.sampler.LayerSampler
-.. autofunction:: dgl.contrib.sampling.sampler.EdgeSampler
-
-Distributed sampler
-------------------------
-
-.. currentmodule:: dgl.contrib.sampling.dis_sampler
-.. autoclass:: SamplerPool
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    SamplerPool.start
-    SamplerPool.worker
-
-.. autoclass:: SamplerSender
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    SamplerSender.send
-    SamplerSender.signal
-
-.. autoclass:: SamplerReceiver
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    SamplerReceiver.__iter__
-    SamplerReceiver.__next__
diff --git a/docs/source/api/python/traversal.rst b/docs/source/api/python/traversal.rst
deleted file mode 100644
index a89f5550ad85..000000000000
--- a/docs/source/api/python/traversal.rst
+++ /dev/null
@@ -1,23 +0,0 @@
-dgl.traversal
-===============
-
-.. automodule:: dgl.traversal
-
-Graph traversal algorithms implemented as python generators, which returns the visited set
-of nodes or edges at each iteration. The naming convention
-is ``<algorithm>_[nodes|edges]_generator``. An example usage is as follows.
-
-.. code:: python
-
-    g = ...  # some DGLGraph
-    for nodes in dgl.bfs_nodes_generator(g, 0):
-        do_something(nodes)
-
-.. autosummary::
-    :toctree: ../../generated/
-
-    bfs_nodes_generator
-    bfs_edges_generator
-    topological_nodes_generator
-    dfs_edges_generator
-    dfs_labeled_edges_generator
diff --git a/docs/source/conf.py b/docs/source/conf.py
index d0ee181dbe4e..2376fcc42a6c 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -195,14 +195,12 @@
 # sphinx gallery configurations
 from sphinx_gallery.sorting import FileNameSortKey
 
-examples_dirs = ['../../tutorials/basics',
-                 '../../tutorials/models',
-                 '../../new-tutorial/blitz',
-                 '../../new-tutorial/large']  # path to find sources
-gallery_dirs = ['tutorials/basics',
-                'tutorials/models',
-                'new-tutorial/blitz',
-                'new-tutorial/large']  # path to generate docs
+examples_dirs = ['../../tutorials/blitz',
+                 '../../tutorials/large',
+                 '../../tutorials/models']  # path to find sources
+gallery_dirs = ['tutorials/blitz/',
+                'tutorials/large/',
+                'tutorials/models/']  # path to generate docs
 reference_url = {
     'dgl' : None,
     'numpy': 'http://docs.scipy.org/doc/numpy/',
diff --git a/docs/source/guide/distributed-apis.rst b/docs/source/guide/distributed-apis.rst
index 786e4e03d991..b6d83f404614 100644
--- a/docs/source/guide/distributed-apis.rst
+++ b/docs/source/guide/distributed-apis.rst
@@ -101,13 +101,11 @@ Users can also assign a new :class:`~dgl.distributed.DistTensor` to
 
 .. code:: python
 
-    g.ndata['train_mask']
-    <dgl.distributed.dist_graph.DistTensor at 0x7fec820937b8>
-    g.ndata['train_mask'][0]
-    tensor([1], dtype=torch.uint8)
+    g.ndata['train_mask']  # <dgl.distributed.dist_graph.DistTensor at 0x7fec820937b8>
+    g.ndata['train_mask'][0]  # tensor([1], dtype=torch.uint8)
 
 Distributed Tensor
-~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~
 
 As mentioned earlier, DGL shards node/edge features and stores them in a cluster of machines.
 DGL provides distributed tensors with a tensor-like interface to access the partitioned
@@ -124,7 +122,7 @@ in the cluster even if the :class:`~dgl.distributed.DistTensor` object disappear
 
 .. code:: python
 
-    tensor = dgl.distributed.DistTensor((g.number_of_nodes(), 10), th.float32, name=’test’)
+    tensor = dgl.distributed.DistTensor((g.number_of_nodes(), 10), th.float32, name='test')
 
 **Note**: :class:`~dgl.distributed.DistTensor` creation is a synchronized operation. All trainers
 have to invoke the creation and the creation succeeds only when all trainers call it. 
diff --git a/docs/source/guide/distributed-preprocessing.rst b/docs/source/guide/distributed-preprocessing.rst
index 6b9617faa3d6..677a13aece6b 100644
--- a/docs/source/guide/distributed-preprocessing.rst
+++ b/docs/source/guide/distributed-preprocessing.rst
@@ -67,7 +67,7 @@ The following example considers nodes inside the training set and outside the tr
 
 .. code:: python
 
-    dgl.distributed.partition_graph(g, ‘graph_name’, 4, ‘/tmp/test’, balance_ntypes=g.ndata[‘train_mask’])
+    dgl.distributed.partition_graph(g, 'graph_name', 4, '/tmp/test', balance_ntypes=g.ndata['train_mask'])
 
 In addition to balancing the node types, :func:`dgl.distributed.partition_graph` also allows balancing
 between in-degrees of nodes of different node types by specifying ``balance_edges``. This balances
diff --git a/docs/source/guide/index.rst b/docs/source/guide/index.rst
index 789cb69a82a9..24f42e12e2fb 100644
--- a/docs/source/guide/index.rst
+++ b/docs/source/guide/index.rst
@@ -12,3 +12,4 @@ User Guide
   training
   minibatch
   distributed
+  mixed_precision
diff --git a/docs/source/guide/preface.rst b/docs/source/guide/preface.rst
deleted file mode 100644
index 4ff043cf83df..000000000000
--- a/docs/source/guide/preface.rst
+++ /dev/null
@@ -1,4 +0,0 @@
-Preface
-=======
-
-Preface chapter
diff --git a/docs/source/guide_cn/distributed-apis.rst b/docs/source/guide_cn/distributed-apis.rst
index 3c950bf8b5f4..fb0594d718db 100644
--- a/docs/source/guide_cn/distributed-apis.rst
+++ b/docs/source/guide_cn/distributed-apis.rst
@@ -107,7 +107,7 @@ DGL为分布式张量提供了类似于单机普通张量的接口，以访问
 
 .. code:: python
 
-    tensor = dgl.distributed.DistTensor((g.number_of_nodes(), 10), th.float32, name=’test’)
+    tensor = dgl.distributed.DistTensor((g.number_of_nodes(), 10), th.float32, name='test')
 
 **Note**: :class:`~dgl.distributed.DistTensor` 的创建是一个同步操作。所有训练器都必须调用创建，
 并且只有当所有训练器都调用它时，此创建过程才能成功。
diff --git a/docs/source/guide_cn/distributed-preprocessing.rst b/docs/source/guide_cn/distributed-preprocessing.rst
index 79f3ef6925a9..b142866bcb81 100644
--- a/docs/source/guide_cn/distributed-preprocessing.rst
+++ b/docs/source/guide_cn/distributed-preprocessing.rst
@@ -53,7 +53,7 @@ JSON文件包含所有划分的配置。如果该API没有为节点和边分配
 
 .. code:: python
 
-    dgl.distributed.partition_graph(g, ‘graph_name’, 4, ‘/tmp/test’, balance_ntypes=g.ndata[‘train_mask’])
+    dgl.distributed.partition_graph(g, 'graph_name', 4, '/tmp/test', balance_ntypes=g.ndata['train_mask'])
 
 除了平衡节点的类型之外， :func:`dgl.distributed.partition_graph` 还允许通过指定
 ``balance_edges`` 来平衡每个类型节点在子图中的入度。这平衡了不同类型节点的连边数量。
diff --git a/docs/source/guide_cn/graph.rst b/docs/source/guide_cn/graph.rst
index b6045acecc21..62fe3cb1b753 100644
--- a/docs/source/guide_cn/graph.rst
+++ b/docs/source/guide_cn/graph.rst
@@ -27,9 +27,9 @@ DGL通过其核心数据结构  :class:`~dgl.DGLGraph` 提供了一个以图为
     :hidden:
     :glob:
 
-    graph_cn-basic
-    graph_cn-graphs-nodes-edges
-    graph_cn-feature
-    graph_cn-external
-    graph_cn-heterogeneous
-    graph_cn-gpu
+    graph-basic
+    graph-graphs-nodes-edges
+    graph-feature
+    graph-external
+    graph-heterogeneous
+    graph-gpu
diff --git a/docs/source/guide_cn/index.rst b/docs/source/guide_cn/index.rst
index 8c64cece6e0c..045316d709b3 100644
--- a/docs/source/guide_cn/index.rst
+++ b/docs/source/guide_cn/index.rst
@@ -1,8 +1,6 @@
 用户指南
 ==========
 
-(持续更新中)
-
 .. toctree::
   :maxdepth: 2
   :titlesonly:
diff --git a/docs/source/guide_cn/message.rst b/docs/source/guide_cn/message.rst
index b99bd6c2c313..88a2cbbf141d 100644
--- a/docs/source/guide_cn/message.rst
+++ b/docs/source/guide_cn/message.rst
@@ -1,14 +1,14 @@
 .. _guide_cn-message-passing:
 
 第2章：消息传递范式
-================
+===========================
 
 :ref:`(English Version) <guide-message-passing>`
 
 消息传递是实现GNN的一种通用框架和编程范式。它从聚合与更新的角度归纳总结了多种GNN模型的实现。
 
 消息传递范式
-----------
+----------------------
 
 假设节点 :math:`v` 上的的特征为 :math:`x_v\in\mathbb{R}^{d_1}`，边 :math:`({u}, {v})` 上的特征为 :math:`w_{e}\in\mathbb{R}^{d_2}`。
 **消息传递范式** 定义了以下逐节点和边上的计算：
@@ -21,7 +21,7 @@
 **聚合函数** :math:`\rho` 会聚合节点接受到的消息。 **更新函数** :math:`\psi` 会结合聚合后的消息和节点本身的特征来更新节点的特征。
 
 本章路线图
---------
+--------------------
 
 本章首先介绍了DGL的消息传递API。然后讲解了如何高效地在点和边上使用这些API。本章的最后一节解释了如何在异构图上实现消息传递。
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 93d6dd08f59d..6a04a989b6c8 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -3,75 +3,8 @@
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-Overview of DGL
-===============
-
-Deep Graph Library (DGL) is a Python package built for easy implementation of
-graph neural network model family, on top of existing DL frameworks (e.g.
-PyTorch, MXNet, Gluon etc.).
-
-DGL reduces the implementation of graph neural networks into declaring a set
-of *functions* (or *modules* in PyTorch terminology).  In addition, DGL
-provides:
-
-* Versatile controls over message passing, ranging from low-level operations
-  such as sending along selected edges and receiving on specific nodes, to
-  high-level control such as graph-wide feature updates.
-* Transparent speed optimization with automatic batching of computations and
-  sparse matrix multiplication.
-* Seamless integration with existing deep learning frameworks.
-* Easy and friendly interfaces for node/edge feature access and graph
-  structure manipulation.
-* Good scalability to graphs with tens of millions of vertices.
-
-To begin with, we have prototyped 10 models across various domains:
-semi-supervised learning on graphs (with potentially billions of nodes/edges),
-generative models on graphs, (previously) difficult-to-parallelize tree-based
-models like TreeLSTM, etc. We also implement some conventional models in DGL
-from a new graphical perspective yielding simplicity.
-
-Getting Started
----------------
-
-* :doc:`Installation<install/index>`.
-* :doc:`Quickstart tutorial<tutorials/basics/1_first>` for absolute beginners.
-* :doc:`User guide<guide/index>`.
-* :doc:`用户指南(User guide)中文版<guide_cn/index>`.
-* :doc:`API reference manual<api/python/index>`.
-* :doc:`End-to-end model tutorials<tutorials/models/index>` for learning DGL by popular models on graphs.
-
-..
-  Follow the :doc:`instructions<install/index>` to install DGL.
-  :doc:`<new-tutorial/1_introduction>` is the most common place to get started with.
-  It offers a broad experience of using DGL for deep learning on graph data.
-
-  API reference document lists more endetailed specifications of each API and GNN modules,
-  a useful manual for in-depth developers.
-
-  You can learn other basic concepts of DGL through the dedicated tutorials.
-
-  * Learn constructing, saving and loading graphs with node and edge features :doc:`here<new-tutorial/2_dglgraph>`.
-  * Learn performing computation on graph using message passing :doc:`here<new-tutorial/3_message_passing>`.
-  * Learn link prediction with DGL :doc:`here<new-tutorial/4_link_predict>`.
-  * Learn graph classification with DGL :doc:`here<new-tutorial/5_graph_classification>`.
-  * Learn creating your own dataset for DGL :doc:`here<new-tutorial/6_load_data>`.
-  * Learn working with heterogeneous graph data :doc:`here<tutorials/basics/5_hetero>`.
-
-  End-to-end model tutorials are other good starting points for learning DGL and popular
-  models on graphs. The model tutorials are categorized based on the way they utilize DGL APIs.
-
-  * :ref:`Graph Neural Network and its variant <tutorials1-index>`: Learn how to use DGL to train
-    popular **GNN models** on one input graph.
-  * :ref:`Dealing with many small graphs <tutorials2-index>`: Learn how to train models for
-    many graph samples such as sentence parse trees.
-  * :ref:`Generative models <tutorials3-index>`: Learn how to deal with **dynamically-changing graphs**.
-  * :ref:`Old (new) wines in new bottle <tutorials4-index>`: Learn how to combine DGL with tensor-based
-    DGL framework in a flexible way. Explore new perspective on traditional models by graphs.
-  * :ref:`Training on giant graphs <tutorials5-index>`: Learn how to train graph neural networks
-    on giant graphs.
-
-  Each tutorial is accompanied with a runnable python script and jupyter notebook that
-  can be downloaded. If you would like the tutorials improved, please raise a github issue.
+Welcome to Deep Graph Library Tutorials and Documentation
+=========================================================
 
 .. toctree::
    :maxdepth: 1
@@ -80,33 +13,19 @@ Getting Started
    :glob:
 
    install/index
-   install/backend
+   tutorials/blitz/index
 
 .. toctree::
    :maxdepth: 2
-   :caption: Tutorials
-   :hidden:
-   :glob:
-
-   new-tutorial/blitz/index
-   new-tutorial/large/index
-
-.. toctree::
-   :maxdepth: 3
-   :caption: Model Examples
-   :hidden:
-   :glob:
-
-   tutorials/models/index
-
-.. toctree::
-   :maxdepth: 2
-   :caption: User Guide
+   :caption: Advanced Materials
    :hidden:
    :titlesonly:
    :glob:
 
    guide/index
+   guide_cn/index
+   tutorials/large/index
+   tutorials/models/index
 
 .. toctree::
    :maxdepth: 2
@@ -121,6 +40,7 @@ Getting Started
    api/python/dgl.distributed
    api/python/dgl.function
    api/python/nn
+   api/python/nn.functional
    api/python/dgl.ops
    api/python/dgl.optim
    api/python/dgl.sampling
@@ -145,34 +65,39 @@ Getting Started
    env_var
    resources
 
-Relationship of DGL to other frameworks
----------------------------------------
-DGL is designed to be compatible and agnostic to the existing tensor
-frameworks. It provides a backend adapter interface that allows easy porting
-to other tensor-based, autograd-enabled frameworks.
 
+Deep Graph Library (DGL) is a Python package built for easy implementation of
+graph neural network model family, on top of existing DL frameworks (currently
+supporting PyTorch, MXNet and TensorFlow). It offers a versatile control of message passing,
+speed optimization via auto-batching and highly tuned sparse matrix kernels,
+and multi-GPU/CPU training to scale to graphs of hundreds of millions of
+nodes and edges.
 
-Free software
+Getting Started
+---------------
+
+For absolute beginners, start with the :doc:`Blitz Introduction to DGL <tutorials/blitz/index>`.
+It covers the basic concepts of common graph machine learning tasks and a step-by-step
+on building Graph Neural Networks (GNNs) to solve them.
+
+For acquainted users who wish to learn more advanced usage,
+
+* `Learn DGL by examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+* Read the :doc:`User Guide<guide/index>` (:doc:`中文版链接<guide_cn/index>`), which explains the concepts
+  and usage of DGL in much more details.
+* Go through the tutorials for :doc:`Stochastic Training of GNNs <tutorials/large/index>`,
+  which covers the basic steps for training GNNs on large graphs in mini-batches.
+* :doc:`Study classical papers <tutorials/models/index>` on graph machine learning alongside DGL.
+* Search for the usage of a specific API in the :doc:`API reference manual <api/python/index>`,
+  which organizes all DGL APIs by their namespace.
+
+Contribution
 -------------
 DGL is free software; you can redistribute it and/or modify it under the terms
 of the Apache License 2.0. We welcome contributions.
 Join us on `GitHub <https://github.com/dmlc/dgl>`_ and check out our
 :doc:`contribution guidelines <contribute>`.
 
-History
--------
-Prototype of DGL started in early Spring, 2018, at NYU Shanghai by Prof. `Zheng
-Zhang <https://shanghai.nyu.edu/academics/faculty/directory/zheng-zhang>`_ and
-Quan Gan. Serious development began when `Minjie
-<https://jermainewang.github.io/>`_, `Lingfan <https://cs.nyu.edu/~lingfan/>`_
-and Prof. `Jinyang Li <http://www.news.cs.nyu.edu/~jinyang/>`_ from NYU's
-system group joined, flanked by a team of student volunteers at NYU Shanghai,
-Fudan and other universities (Yu, Zihao, Murphy, Allen, Qipeng, Qi, Hao), as
-well as early adopters at the CILVR lab (Jake Zhao). Development accelerated
-when AWS MXNet Science team joined force, with Da Zheng, Alex Smola, Haibin
-Lin, Chao Ma and a number of others. For full credit, see `here
-<https://www.dgl.ai/ack>`_.
-
 Index
 -----
 * :ref:`genindex`
diff --git a/docs/source/install/backend.rst b/docs/source/install/backend.rst
deleted file mode 100644
index 1575f1c60f30..000000000000
--- a/docs/source/install/backend.rst
+++ /dev/null
@@ -1,48 +0,0 @@
-.. _backends:
-
-Working with different backends
-===============================
-
-DGL supports PyTorch, MXNet and Tensorflow backends. 
-DGL will choose the backend on the following options (high priority to low priority)
-- `DGLBACKEND` environment
-   - You can use `DGLBACKEND=[BACKEND] python gcn.py ...` to specify the backend
-   - Or `export DGLBACKEND=[BACKEND]` to set the global environment variable 
-- `config.json` file under "~/.dgl"
-   - You can use `python -m dgl.backend.set_default_backend [BACKEND]` to set the default backend
-
-Currently BACKEND can be chosen from mxnet, pytorch, tensorflow.
-
-PyTorch backend
----------------
-
-Export ``DGLBACKEND`` as ``pytorch`` to specify PyTorch backend. The required PyTorch
-version is 1.5.0 or later. See `pytorch.org <https://pytorch.org>`_ for installation instructions.
-
-MXNet backend
--------------
-
-Export ``DGLBACKEND`` as ``mxnet`` to specify MXNet backend. The required MXNet version is
-1.5 or later. See `mxnet.apache.org <https://mxnet.apache.org/get_started>`_ for installation
-instructions.
-
-MXNet uses uint32 as the default data type for integer tensors, which only supports graph of
-size smaller than 2^32. To enable large graph training, *build* MXNet with ``USE_INT64_TENSOR_SIZE=1``
-flag. See `this FAQ <https://mxnet.apache.org/api/faq/large_tensor_support>`_ for more information.
-
-MXNet 1.5 and later has an option to enable Numpy shape mode for ``NDArray`` objects, some DGL models
-need this mode to be enabled to run correctly. However, this mode may not compatible with pretrained
-model parameters with this mode disabled, e.g. pretrained models from GluonCV and GluonNLP.
-By setting ``DGL_MXNET_SET_NP_SHAPE``, users can switch this mode on or off.
-
-Tensorflow backend
-------------------
-
-Export ``DGLBACKEND`` as ``tensorflow`` to specify Tensorflow backend. The required Tensorflow
-version is 2.2.0 or later. See `tensorflow.org <https://www.tensorflow.org/install>`_ for installation
-instructions. In addition, DGL will set ``TF_FORCE_GPU_ALLOW_GROWTH`` to ``true`` to prevent Tensorflow take over the whole GPU memory:
-
-.. code:: bash
-
-   pip install "tensorflow>=2.2.0"  # when using tensorflow cpu version
-
diff --git a/docs/source/install/index.rst b/docs/source/install/index.rst
index d3e27bed4e81..f426ae0d0006 100644
--- a/docs/source/install/index.rst
+++ b/docs/source/install/index.rst
@@ -1,7 +1,5 @@
-Install DGL
-===========
-
-This topic explains how to install DGL. We recommend installing DGL by using ``conda`` or ``pip``.
+Install and Setup
+=================
 
 System requirements
 -------------------
@@ -22,7 +20,8 @@ CPU build, then the CPU build is overwritten.
 Install from Conda or Pip
 -------------------------
 
-Check out the `Get Started page <https://www.dgl.ai/pages/start.html>`_.
+We recommend installing DGL by ``conda`` or ``pip``.
+Check out the instructions on the `Get Started page <https://www.dgl.ai/pages/start.html>`_.
 
 .. _install-from-source:
 
@@ -63,20 +62,19 @@ configuration as you wish. For example, change ``USE_CUDA`` to ``ON`` will
 enable a CUDA build. You could also pass ``-DKEY=VALUE`` to the cmake command
 for the same purpose.
 
-- CPU-only build
-   .. code:: bash
+* CPU-only build::
+
+     mkdir build
+     cd build
+     cmake ..
+     make -j4
 
-      mkdir build
-      cd build
-      cmake ..
-      make -j4
-- CUDA build
-   .. code:: bash
+* CUDA build::
 
-      mkdir build
-      cd build
-      cmake -DUSE_CUDA=ON ..
-      make -j4
+     mkdir build
+     cd build
+     cmake -DUSE_CUDA=ON ..
+     make -j4
 
 Finally, install the Python binding.
 
@@ -125,8 +123,7 @@ You can build DGL with MSBuild.  With `MS Build Tools <https://go.microsoft.com/
 and `CMake on Windows <https://cmake.org/download/>`_ installed, run the following
 in VS2019 x64 Native tools command prompt.
 
-- CPU only build
-  .. code::
+* CPU only build::
 
      MD build
      CD build
@@ -134,8 +131,8 @@ in VS2019 x64 Native tools command prompt.
      msbuild dgl.sln /m
      CD ..\python
      python setup.py install
-- CUDA build
-  .. code::
+
+* CUDA build::
 
      MD build
      CD build
@@ -144,9 +141,61 @@ in VS2019 x64 Native tools command prompt.
      CD ..\python
      python setup.py install
 
-Optional Flags
-``````````````
+Compilation Flags
+`````````````````
+
+See `config.cmake <https://github.com/dmlc/dgl/blob/master/cmake/config.cmake>`_.
+
+
+.. _backends:
+
+Working with different backends
+-------------------------------
+
+DGL supports PyTorch, MXNet and Tensorflow backends. 
+DGL will choose the backend on the following options (high priority to low priority)
+
+* Use the ``DGLBACKEND`` environment variable:
+
+   - You can use ``DGLBACKEND=[BACKEND] python gcn.py ...`` to specify the backend
+   - Or ``export DGLBACKEND=[BACKEND]`` to set the global environment variable 
+
+* Modify the ``config.json`` file under "~/.dgl":
+
+   - You can use ``python -m dgl.backend.set_default_backend [BACKEND]`` to set the default backend
+
+Currently BACKEND can be chosen from mxnet, pytorch, tensorflow.
+
+PyTorch backend
+```````````````
+
+Export ``DGLBACKEND`` as ``pytorch`` to specify PyTorch backend. The required PyTorch
+version is 1.5.0 or later. See `pytorch.org <https://pytorch.org>`_ for installation instructions.
+
+MXNet backend
+`````````````
+
+Export ``DGLBACKEND`` as ``mxnet`` to specify MXNet backend. The required MXNet version is
+1.5 or later. See `mxnet.apache.org <https://mxnet.apache.org/get_started>`_ for installation
+instructions.
+
+MXNet uses uint32 as the default data type for integer tensors, which only supports graph of
+size smaller than 2^32. To enable large graph training, *build* MXNet with ``USE_INT64_TENSOR_SIZE=1``
+flag. See `this FAQ <https://mxnet.apache.org/api/faq/large_tensor_support>`_ for more information.
+
+MXNet 1.5 and later has an option to enable Numpy shape mode for ``NDArray`` objects, some DGL models
+need this mode to be enabled to run correctly. However, this mode may not compatible with pretrained
+model parameters with this mode disabled, e.g. pretrained models from GluonCV and GluonNLP.
+By setting ``DGL_MXNET_SET_NP_SHAPE``, users can switch this mode on or off.
+
+Tensorflow backend
+``````````````````
+
+Export ``DGLBACKEND`` as ``tensorflow`` to specify Tensorflow backend. The required Tensorflow
+version is 2.2.0 or later. See `tensorflow.org <https://www.tensorflow.org/install>`_ for installation
+instructions. In addition, DGL will set ``TF_FORCE_GPU_ALLOW_GROWTH`` to ``true`` to prevent Tensorflow take over the whole GPU memory:
+
+.. code:: bash
+
+   pip install "tensorflow>=2.2.0"  # when using tensorflow cpu version
 
-- If you are using PyTorch, you can add ``-DBUILD_TORCH=ON`` flag in CMake
-  to build PyTorch plugins for further performance optimization.  This applies for Linux,
-  Windows, and Mac.
diff --git a/new-tutorial/README.txt b/new-tutorial/README.txt
deleted file mode 100644
index e69de29bb2d1..000000000000
diff --git a/python/dgl/distributed/partition.py b/python/dgl/distributed/partition.py
index 986ef0abc178..18e0a3346c5b 100644
--- a/python/dgl/distributed/partition.py
+++ b/python/dgl/distributed/partition.py
@@ -333,7 +333,7 @@ def partition_graph(g, graph_name, num_parts, out_path, num_hops=1, part_method=
     * "inner_node" indicates whether a node belongs to a partition.
     * "inner_edge" indicates whether an edge belongs to a partition.
     * "orig_id" exists when reshuffle=True. It indicates the original node IDs in the original
-    graph before reshuffling.
+      graph before reshuffling.
 
     Node and edge features are splitted and stored together with each graph partition.
     All node/edge features in a partition are stored in a file with DGL format. The node/edge
diff --git a/python/dgl/heterograph.py b/python/dgl/heterograph.py
index 833e9a1e9b86..3e003bab2d3d 100644
--- a/python/dgl/heterograph.py
+++ b/python/dgl/heterograph.py
@@ -5420,7 +5420,7 @@ def formats(self, formats=None):
 
             * If formats is None, return the usage status of sparse formats
             * Otherwise, it can be ``'coo'``/``'csr'``/``'csc'`` or a sublist of
-            them, specifying the sparse formats to use.
+              them, specifying the sparse formats to use.
 
         Returns
         -------
diff --git a/python/dgl/ops/edge_softmax.py b/python/dgl/ops/edge_softmax.py
index 635bf6147b1e..eb871edaa73b 100644
--- a/python/dgl/ops/edge_softmax.py
+++ b/python/dgl/ops/edge_softmax.py
@@ -9,8 +9,6 @@
 def edge_softmax(graph, logits, eids=ALL, norm_by='dst'):
     r"""Compute softmax over weights of incoming edges for every node.
 
-    Description
-    -----------
     For a node :math:`i`, edge softmax is an operation that computes
 
     .. math::
@@ -28,6 +26,9 @@ def edge_softmax(graph, logits, eids=ALL, norm_by='dst'):
     An example of using edge softmax is in
     `Graph Attention Network <https://arxiv.org/pdf/1710.10903.pdf>`__ where
     the attention weights are computed with this operation.
+    Other non-GNN examples using this are
+    `Transformer <https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>`__,
+    `Capsule <https://arxiv.org/pdf/1710.09829.pdf>`__, etc.
 
     Parameters
     ----------
diff --git a/python/dgl/ops/sddmm.py b/python/dgl/ops/sddmm.py
index da6df6a0318b..ad13be9c5670 100644
--- a/python/dgl/ops/sddmm.py
+++ b/python/dgl/ops/sddmm.py
@@ -13,6 +13,7 @@ def gsddmm(g, op, lhs_data, rhs_data, lhs_target='u', rhs_target='v'):
     It computes edge features by :attr:`op` lhs features and rhs features.
 
     .. math::
+
         x_{e} = \phi(x_{lhs}, x_{rhs}), \forall (u,e,v)\in \mathcal{G}
 
     where :math:`x_{e}` is the returned feature on edges and :math:`x_u`,
@@ -33,9 +34,9 @@ def gsddmm(g, op, lhs_data, rhs_data, lhs_target='u', rhs_target='v'):
     rhs_data : tensor or None
         The right operand, could be None if it's not required by op.
     lhs_target: str
-        Choice of `u`(source), `e`(edge) or `v`(destination) for left operand.
+        Choice of ``u``(source), ``e``(edge) or ``v``(destination) for left operand.
     rhs_target: str
-        Choice of `u`(source), `e`(edge) or `v`(destination) for right operand.
+        Choice of ``u``(source), ``e``(edge) or ``v``(destination) for right operand.
 
     Returns
     -------
diff --git a/tests/scripts/task_pytorch_tutorial_test.sh b/tests/scripts/task_pytorch_tutorial_test.sh
index 76ff2ee6752b..171812e298c4 100644
--- a/tests/scripts/task_pytorch_tutorial_test.sh
+++ b/tests/scripts/task_pytorch_tutorial_test.sh
@@ -4,7 +4,6 @@
 . /opt/conda/etc/profile.d/conda.sh
 conda activate pytorch-ci
 TUTORIAL_ROOT="./tutorials"
-NEW_TUTORIAL_ROOT="./new-tutorial"
 
 function fail {
     echo FAIL: $@
@@ -29,11 +28,3 @@ do
 done
 
 popd > /dev/null
-
-pushd ${NEW_TUTORIAL_ROOT} > /dev/null
-for f in $(find . -name "*.py" ! -name "*_mx.py")
-do
-    echo "Running tutorial ${f} ..."
-    python3 $f || fail "run ${f}"
-done
-popd > /dev/null
diff --git a/tutorials/README.txt b/tutorials/README.txt
deleted file mode 100644
index e69de29bb2d1..000000000000
diff --git a/tutorials/basics/1_first.py b/tutorials/basics/1_first.py
deleted file mode 100644
index 8e0199a0afae..000000000000
--- a/tutorials/basics/1_first.py
+++ /dev/null
@@ -1,254 +0,0 @@
-"""
-.. currentmodule:: dgl
-
-DGL at a Glance
-=========================
-
-**Author**: `Minjie Wang <https://jermainewang.github.io/>`_, Quan Gan, `Jake
-Zhao <https://cs.nyu.edu/~jakezhao/>`_, Zheng Zhang
-
-DGL is a Python package dedicated to deep learning on graphs, built atop
-existing tensor DL frameworks (e.g. Pytorch, MXNet) and simplifying the
-implementation of graph-based neural networks.
-
-The goal of this tutorial:
-
-- Understand how DGL enables computation on graph from a high level.
-- Train a simple graph neural network in DGL to classify nodes in a graph.
-
-At the end of this tutorial, we hope you get a brief feeling of how DGL works.
-
-*This tutorial assumes basic familiarity with pytorch.*
-"""
-
-###############################################################################
-# Tutorial problem description
-# ----------------------------
-#
-# The tutorial is based on the "Zachary's karate club" problem. The karate club
-# is a social network that includes 34 members and documents pairwise links
-# between members who interact outside the club.  The club later divides into
-# two communities led by the instructor (node 0) and the club president (node
-# 33). The network is visualized as follows with the color indicating the
-# community:
-#
-# .. image:: https://data.dgl.ai/tutorial/img/karate-club.png
-#    :align: center
-#
-# The task is to predict which side (0 or 33) each member tends to join given
-# the social network itself.
-
-
-###############################################################################
-# Step 1: Creating a graph in DGL
-# -------------------------------
-# Create the graph for Zachary's karate club as follows:
-
-import dgl
-import numpy as np
-
-def build_karate_club_graph():
-    # All 78 edges are stored in two numpy arrays. One for source endpoints
-    # while the other for destination endpoints.
-    src = np.array([1, 2, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 9, 10, 10,
-        10, 11, 12, 12, 13, 13, 13, 13, 16, 16, 17, 17, 19, 19, 21, 21,
-        25, 25, 27, 27, 27, 28, 29, 29, 30, 30, 31, 31, 31, 31, 32, 32,
-        32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33,
-        33, 33, 33, 33, 33, 33, 33, 33, 33, 33])
-    dst = np.array([0, 0, 1, 0, 1, 2, 0, 0, 0, 4, 5, 0, 1, 2, 3, 0, 2, 2, 0, 4,
-        5, 0, 0, 3, 0, 1, 2, 3, 5, 6, 0, 1, 0, 1, 0, 1, 23, 24, 2, 23,
-        24, 2, 23, 26, 1, 8, 0, 24, 25, 28, 2, 8, 14, 15, 18, 20, 22, 23,
-        29, 30, 31, 8, 9, 13, 14, 15, 18, 19, 20, 22, 23, 26, 27, 28, 29, 30,
-        31, 32])
-    # Edges are directional in DGL; Make them bi-directional.
-    u = np.concatenate([src, dst])
-    v = np.concatenate([dst, src])
-    # Construct a DGLGraph
-    return dgl.graph((u, v))
-
-###############################################################################
-# Print out the number of nodes and edges in our newly constructed graph:
-
-G = build_karate_club_graph()
-print('We have %d nodes.' % G.number_of_nodes())
-print('We have %d edges.' % G.number_of_edges())
-
-###############################################################################
-# Visualize the graph by converting it to a `networkx
-# <https://networkx.github.io/documentation/stable/>`_ graph:
-
-import networkx as nx
-# Since the actual graph is undirected, we convert it for visualization
-# purpose.
-nx_G = G.to_networkx().to_undirected()
-# Kamada-Kawaii layout usually looks pretty for arbitrary graphs
-pos = nx.kamada_kawai_layout(nx_G)
-nx.draw(nx_G, pos, with_labels=True, node_color=[[.7, .7, .7]])
-
-###############################################################################
-# Step 2: Assign features to nodes or edges
-# --------------------------------------------
-# Graph neural networks associate features with nodes and edges for training.
-# For our classification example, since there is no input feature, we assign each node
-# with a learnable embedding vector.
-
-# In DGL, you can add features for all nodes at once, using a feature tensor that
-# batches node features along the first dimension. The code below adds the learnable
-# embeddings for all nodes:
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-embed = nn.Embedding(34, 5)  # 34 nodes with embedding dim equal to 5
-G.ndata['feat'] = embed.weight
-
-###############################################################################
-# Print out the node features to verify:
-
-# print out node 2's input feature
-print(G.ndata['feat'][2])
-
-# print out node 10 and 11's input features
-print(G.ndata['feat'][[10, 11]])
-
-###############################################################################
-# Step 3: Define a Graph Convolutional Network (GCN)
-# --------------------------------------------------
-# To perform node classification, use the Graph Convolutional Network
-# (GCN) developed by `Kipf and Welling <https://arxiv.org/abs/1609.02907>`_. Here
-# is the simplest definition of a GCN framework. We recommend that you 
-# read the original paper for more details.
-#
-# - At layer :math:`l`, each node :math:`v_i^l` carries a feature vector :math:`h_i^l`.
-# - Each layer of the GCN tries to aggregate the features from :math:`u_i^{l}` where
-#   :math:`u_i`'s are neighborhood nodes to :math:`v` into the next layer representation at
-#   :math:`v_i^{l+1}`. This is followed by an affine transformation with some
-#   non-linearity.
-#
-# The above definition of GCN fits into a **message-passing** paradigm: Each
-# node will update its own feature with information sent from neighboring
-# nodes. A graphical demonstration is displayed below.
-#
-# .. image:: https://data.dgl.ai/tutorial/1_first/mailbox.png
-#    :alt: mailbox
-#    :align: center
-#
-# In DGL, we provide implementations of popular Graph Neural Network layers under
-# the `dgl.<backend>.nn` subpackage. The :class:`~dgl.nn.pytorch.GraphConv` module
-# implements one Graph Convolutional layer.
-
-from dgl.nn.pytorch import GraphConv
-
-###############################################################################
-# Define a deeper GCN model that contains two GCN layers:
-
-class GCN(nn.Module):
-    def __init__(self, in_feats, hidden_size, num_classes):
-        super(GCN, self).__init__()
-        self.conv1 = GraphConv(in_feats, hidden_size)
-        self.conv2 = GraphConv(hidden_size, num_classes)
-
-    def forward(self, g, inputs):
-        h = self.conv1(g, inputs)
-        h = torch.relu(h)
-        h = self.conv2(g, h)
-        return h
-
-# The first layer transforms input features of size of 5 to a hidden size of 5.
-# The second layer transforms the hidden layer and produces output features of
-# size 2, corresponding to the two groups of the karate club.
-net = GCN(5, 5, 2)
-
-###############################################################################
-# Step 4: Data preparation and initialization
-# -------------------------------------------
-#
-# We use learnable embeddings to initialize the node features. Since this is a
-# semi-supervised setting, only the instructor (node 0) and the club president
-# (node 33) are assigned labels. The implementation is available as follow.
-
-inputs = embed.weight
-labeled_nodes = torch.tensor([0, 33])  # only the instructor and the president nodes are labeled
-labels = torch.tensor([0, 1])  # their labels are different
-
-###############################################################################
-# Step 5: Train then visualize
-# ----------------------------
-# The training loop is exactly the same as other PyTorch models.
-# We (1) create an optimizer, (2) feed the inputs to the model,
-# (3) calculate the loss and (4) use autograd to optimize the model.
-import itertools
-
-optimizer = torch.optim.Adam(itertools.chain(net.parameters(), embed.parameters()), lr=0.01)
-all_logits = []
-for epoch in range(50):
-    logits = net(G, inputs)
-    # we save the logits for visualization later
-    all_logits.append(logits.detach())
-    logp = F.log_softmax(logits, 1)
-    # we only compute loss for labeled nodes
-    loss = F.nll_loss(logp[labeled_nodes], labels)
-
-    optimizer.zero_grad()
-    loss.backward()
-    optimizer.step()
-
-    print('Epoch %d | Loss: %.4f' % (epoch, loss.item()))
-
-###############################################################################
-# This is a rather toy example, so it does not even have a validation or test
-# set. Instead, Since the model produces an output feature of size 2 for each node, we can
-# visualize by plotting the output feature in a 2D space.
-# The following code animates the training process from initial guess
-# (where the nodes are not classified correctly at all) to the end
-# (where the nodes are linearly separable).
-
-import matplotlib.animation as animation
-import matplotlib.pyplot as plt
-
-def draw(i):
-    cls1color = '#00FFFF'
-    cls2color = '#FF00FF'
-    pos = {}
-    colors = []
-    for v in range(34):
-        pos[v] = all_logits[i][v].numpy()
-        cls = pos[v].argmax()
-        colors.append(cls1color if cls else cls2color)
-    ax.cla()
-    ax.axis('off')
-    ax.set_title('Epoch: %d' % i)
-    nx.draw_networkx(nx_G.to_undirected(), pos, node_color=colors,
-            with_labels=True, node_size=300, ax=ax)
-
-fig = plt.figure(dpi=150)
-fig.clf()
-ax = fig.subplots()
-draw(0)  # draw the prediction of the first epoch
-plt.close()
-
-###############################################################################
-# .. image:: https://data.dgl.ai/tutorial/1_first/karate0.png
-#    :height: 300px
-#    :width: 400px
-#    :align: center
-
-###############################################################################
-# The following animation shows how the model correctly predicts the community
-# after a series of training epochs.
-
-ani = animation.FuncAnimation(fig, draw, frames=len(all_logits), interval=200)
-
-###############################################################################
-# .. image:: https://data.dgl.ai/tutorial/1_first/karate.gif
-#    :height: 300px
-#    :width: 400px
-#    :align: center
-
-###############################################################################
-# Next steps
-# ----------
-# 
-# In the :doc:`next tutorial <2_basics>`, we will go through some more basics
-# of DGL, such as reading and writing node/edge features.
diff --git a/tutorials/basics/2_basics.py b/tutorials/basics/2_basics.py
deleted file mode 100644
index 6aa6790c0b88..000000000000
--- a/tutorials/basics/2_basics.py
+++ /dev/null
@@ -1,194 +0,0 @@
-"""
-.. currentmodule:: dgl
-
-DGLGraph and Node/edge Features
-===============================
-
-**Author**: `Minjie Wang <https://jermainewang.github.io/>`_, Quan Gan, Yu Gai,
-Zheng Zhang
-
-In this tutorial, you learn how to create a graph and how to read and write node and edge representations.
-"""
-
-###############################################################################
-# Creating a graph
-# ----------------
-# The design of :class:`DGLGraph` was influenced by other graph libraries. You 
-# can create a graph from networkx and convert it into a :class:`DGLGraph` and 
-# vice versa.
-
-import networkx as nx
-import dgl
-
-g_nx = nx.petersen_graph()
-g_dgl = dgl.DGLGraph(g_nx)
-
-import matplotlib.pyplot as plt
-plt.subplot(121)
-nx.draw(g_nx, with_labels=True)
-plt.subplot(122)
-nx.draw(g_dgl.to_networkx(), with_labels=True)
-
-plt.show()
-
-
-###############################################################################
-# There are many ways to construct a :class:`DGLGraph`. Below are the allowed
-# data types ordered by our recommendataion.
-#
-# * A pair of arrays ``(u, v)`` storing the source and destination nodes respectively.
-#   They can be numpy arrays or tensor objects from the backend framework.
-# * ``scipy`` sparse matrix representing the adjacency matrix of the graph to be
-#   constructed.
-# * ``networkx`` graph object.
-# * A list of edges in the form of integer pairs.
-#
-# The examples below construct the same star graph via different methods.
-#
-# :class:`DGLGraph` nodes are a consecutive range of integers between 0 and
-# :func:`number_of_nodes() <DGLGraph.number_of_nodes>`. 
-# :class:`DGLGraph` edges are in order of their additions. Note that
-# edges are accessed in much the same way as nodes, with one extra feature:
-# *edge broadcasting*.
-
-import torch as th
-import numpy as np
-import scipy.sparse as spp
-
-# Create a star graph from a pair of arrays (using ``numpy.array`` works too).
-u = th.tensor([0, 0, 0, 0, 0])
-v = th.tensor([1, 2, 3, 4, 5])
-star1 = dgl.DGLGraph((u, v))
-
-# Create the same graph from a scipy sparse matrix (using ``scipy.sparse.csr_matrix`` works too).
-adj = spp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
-star3 = dgl.DGLGraph(adj)
-
-###############################################################################
-# You can also create a graph by progressively adding more nodes and edges.
-# Although it is not as efficient as the above constructors, it is suitable
-# for applications where the graph cannot be constructed in one shot.
-
-g = dgl.DGLGraph()
-g.add_nodes(10)
-# A couple edges one-by-one
-for i in range(1, 4):
-    g.add_edge(i, 0)
-# A few more with a paired list
-src = list(range(5, 8)); dst = [0]*3
-g.add_edges(src, dst)
-# finish with a pair of tensors
-src = th.tensor([8, 9]); dst = th.tensor([0, 0])
-g.add_edges(src, dst)
-
-# Edge broadcasting will do star graph in one go!
-g = dgl.DGLGraph()
-g.add_nodes(10)
-src = th.tensor(list(range(1, 10)));
-g.add_edges(src, 0)
-
-# Visualize the graph.
-nx.draw(g.to_networkx(), with_labels=True)
-plt.show()
-
-###############################################################################
-# Assigning a feature
-# -------------------
-# You can also assign features to nodes and edges of a :class:`DGLGraph`.  The
-# features are represented as dictionary of names (strings) and tensors,
-# called **fields**.
-#
-# The following code snippet assigns each node a vector (len=3).
-#
-# .. note::
-#
-#    DGL aims to be framework-agnostic, and currently it supports PyTorch and
-#    MXNet tensors. The following examples use PyTorch only.
-
-import dgl
-import torch as th
-
-x = th.randn(10, 3)
-g.ndata['x'] = x
-
-###############################################################################
-# :func:`ndata <DGLGraph.ndata>` is a syntax sugar to access the feature
-# data of all nodes. To get the features of some particular nodes, slice out
-# the corresponding rows.
-
-g.ndata['x'][0] = th.zeros(1, 3)
-g.ndata['x'][[0, 1, 2]] = th.zeros(3, 3)
-g.ndata['x'][th.tensor([0, 1, 2])] = th.randn((3, 3))
-
-###############################################################################
-# Assigning edge features is similar to that of node features,
-# except that you can also do it by specifying endpoints of the edges.
-
-g.edata['w'] = th.randn(9, 2)
-
-# Access edge set with IDs in integer, list, or integer tensor
-g.edata['w'][1] = th.randn(1, 2)
-g.edata['w'][[0, 1, 2]] = th.zeros(3, 2)
-g.edata['w'][th.tensor([0, 1, 2])] = th.zeros(3, 2)
-
-# You can get the edge ids by giving endpoints, which are useful for accessing the features.
-g.edata['w'][g.edge_id(1, 0)] = th.ones(1, 2)                   # edge 1 -> 0
-g.edata['w'][g.edge_ids([1, 2, 3], [0, 0, 0])] = th.ones(3, 2)  # edges [1, 2, 3] -> 0
-# Use edge broadcasting whenever applicable.
-g.edata['w'][g.edge_ids([1, 2, 3], [0, 0, 0])] = th.ones(3, 2)          # edges [1, 2, 3] -> 0
-
-###############################################################################
-# After assignments, each node or edge field will be associated with a scheme
-# containing the shape and data type (dtype) of its field value.
-
-print(g.node_attr_schemes())
-g.ndata['x'] = th.zeros((10, 4))
-print(g.node_attr_schemes())
-
-
-###############################################################################
-# You can also remove node or edge states from the graph. This is particularly
-# useful to save memory during inference.
-
-g.ndata.pop('x')
-g.edata.pop('w')
-
-
-###############################################################################
-# Working with multigraphs
-# ~~~~~~~~~~~~~~~~~~~~~~~~
-# Many graph applications need parallel edges,
-# which class:DGLGraph supports by default.
-
-g_multi = dgl.DGLGraph()
-g_multi.add_nodes(10)
-g_multi.ndata['x'] = th.randn(10, 2)
-
-g_multi.add_edges(list(range(1, 10)), 0)
-g_multi.add_edge(1, 0) # two edges on 1->0
-
-g_multi.edata['w'] = th.randn(10, 2)
-g_multi.edges[1].data['w'] = th.zeros(1, 2)
-print(g_multi.edges())
-
-
-###############################################################################
-# An edge in multigraph cannot be uniquely identified by using its incident nodes
-# :math:`u` and :math:`v`; query their edge IDs use ``edge_id`` interface.
-
-_, _, eid_10 = g_multi.edge_id(1, 0, return_uv=True)
-g_multi.edges[eid_10].data['w'] = th.ones(len(eid_10), 2)
-print(g_multi.edata['w'])
-
-
-###############################################################################
-# .. note::
-#
-#    * Updating a feature of different schemes raises the risk of error on individual nodes (or
-#      node subset).
-
-###############################################################################
-# Next steps
-# ----------
-# In the :doc:`next tutorial <3_pagerank>` you learn the
-# DGL message passing interface by implementing PageRank.
diff --git a/tutorials/basics/3_pagerank.py.bak b/tutorials/basics/3_pagerank.py.bak
deleted file mode 100644
index f99678c53382..000000000000
--- a/tutorials/basics/3_pagerank.py.bak
+++ /dev/null
@@ -1,244 +0,0 @@
-"""
-.. currentmodule:: dgl
-
-Message Passing Tutorial
-========================
-
-**Author**: `Minjie Wang <https://jermainewang.github.io/>`_, Quan Gan, Yu Gai,
-Zheng Zhang
-
-In this tutorial, you learn how to use different levels of the message
-passing API with PageRank on a small graph. In DGL, the message passing and
-feature transformations are **user-defined functions** (UDFs).
-
-"""
-
-###############################################################################
-# The PageRank algorithm
-# ----------------------
-# In each iteration of PageRank, every node (web page) first scatters its
-# PageRank value uniformly to its downstream nodes. The new PageRank value of
-# each node is computed by aggregating the received PageRank values from its
-# neighbors, which is then adjusted by the damping factor:
-#
-# .. math::
-#
-#    PV(u) = \frac{1-d}{N} + d \times \sum_{v \in \mathcal{N}(u)}
-#    \frac{PV(v)}{D(v)}
-#
-# where :math:`N` is the number of nodes in the graph; :math:`D(v)` is the
-# out-degree of a node :math:`v`; and :math:`\mathcal{N}(u)` is the neighbor
-# nodes.
-
-
-###############################################################################
-# A naive implementation
-# ----------------------
-# Create a graph with 100 nodes by using ``networkx`` and then convert it to a
-# :class:`DGLGraph`.
-
-import networkx as nx
-import matplotlib.pyplot as plt
-import torch
-import dgl
-
-N = 100  # number of nodes
-DAMP = 0.85  # damping factor
-K = 10  # number of iterations
-g = nx.nx.erdos_renyi_graph(N, 0.1)
-g = dgl.DGLGraph(g)
-nx.draw(g.to_networkx(), node_size=50, node_color=[[.5, .5, .5,]])
-plt.show()
-
-
-###############################################################################
-# According to the algorithm, PageRank consists of two phases in a typical
-# scatter-gather pattern. Initialize the PageRank value of each node
-# to :math:`\frac{1}{N}` and then store each node's out-degree as a node feature.
-
-g.ndata['pv'] = torch.ones(N) / N
-g.ndata['deg'] = g.out_degrees(g.nodes()).float()
-
-
-###############################################################################
-# Define the message function, which divides every node's PageRank
-# value by its out-degree and passes the result as message to its neighbors.
-
-def pagerank_message_func(edges):
-    return {'pv' : edges.src['pv'] / edges.src['deg']}
-
-
-###############################################################################
-# In DGL, the message functions are expressed as **Edge UDFs**.  Edge UDFs
-# take in a single argument ``edges``.  It has three members ``src``, ``dst``,
-# and ``data`` for accessing source node features, destination node features,
-# and edge features.  Here, the function computes messages only
-# from source node features.
-#
-# Define the reduce function, which removes and aggregates the
-# messages from its ``mailbox``, and computes its new PageRank value.
-
-def pagerank_reduce_func(nodes):
-    msgs = torch.sum(nodes.mailbox['pv'], dim=1)
-    pv = (1 - DAMP) / N + DAMP * msgs
-    return {'pv' : pv}
-
-
-###############################################################################
-# The reduce functions are **Node UDFs**.  Node UDFs have a single argument
-# ``nodes``, which has two members ``data`` and ``mailbox``.  ``data``
-# contains the node features and ``mailbox`` contains all incoming message
-# features, stacked along the second dimension (hence the ``dim=1`` argument).
-#
-# The message UDF works on a batch of edges, whereas the reduce UDF works on
-# a batch of edges but outputs a batch of nodes. Their relationships are as
-# follows:
-#
-# .. image:: https://i.imgur.com/kIMiuFb.png
-#
-
-###############################################################################
-# The algorithm is straightforward. Here is the code for one
-# PageRank iteration.
-
-def pagerank_naive(g):
-    # Phase #1: send out messages along all edges.
-    for u, v in zip(*g.edges()):
-        g.send((u, v), pagerank_message_func)
-    # Phase #2: receive messages to compute new PageRank values.
-    for v in g.nodes():
-        g.recv(v, pagerank_reduce_func)
-
-
-###############################################################################
-# Batching semantics for a large graph
-# ------------------------------------
-# The above code does not scale to a large graph because it iterates over all
-# the nodes. DGL solves this by allowing you to compute on a *batch* of nodes or
-# edges. For example, the following codes trigger message and reduce functions
-# on multiple nodes and edges at one time.
-
-def pagerank_batch(g):
-    g.send(g.edges(), pagerank_message_func)
-    g.recv(g.nodes(), pagerank_reduce_func)
-
-
-###############################################################################
-# You are still using the same reduce function ``pagerank_reduce_func``,
-# where ``nodes.mailbox['pv']`` is a *single* tensor, stacking the incoming
-# messages along the second dimension.
-#
-# You might wonder if this is even possible to perform reduce on all
-# nodes in parallel, since each node may have different number of incoming
-# messages and you cannot really "stack" tensors of different lengths together.
-# In general, DGL solves the problem by grouping the nodes by the number of
-# incoming messages, and calling the reduce function for each group.
-
-
-###############################################################################
-# Use higher-level APIs for efficiency
-# ---------------------------------------
-# DGL provides many routines that combine basic ``send`` and ``recv`` in
-# various ways. These routines are called **level-2 APIs**. For example, the next code example
-# shows how to further simplify the PageRank example with such an API.
-
-def pagerank_level2(g):
-    g.update_all()
-
-
-###############################################################################
-# In addition to ``update_all``, you can use ``pull``, ``push``, and ``send_and_recv``
-# in this level-2 category. For more information, see :doc:`API reference <../../api/python/graph>`.
-
-
-###############################################################################
-# Use DGL ``builtin`` functions for efficiency
-# ------------------------------------------------
-# Some of the message and reduce functions are used frequently. For this reason, DGL also
-# provides ``builtin`` functions. For example, two ``builtin`` functions can be
-# used in the PageRank example.
-#
-# * :func:`dgl.function.copy_src(src, out) <function.copy_src>` - This
-#   code example is an edge UDF that computes the
-#   output using the source node feature data. To use this, specify the name of
-#   the source feature data (``src``) and the output name (``out``).
-# 
-# * :func:`dgl.function.sum(msg, out) <function.sum>` - This code example is a node UDF
-#   that sums the messages in
-#   the node's mailbox. To use this, specify the message name (``msg``) and the
-#   output name (``out``).
-#
-# The following PageRank example shows such functions.
-
-import dgl.function as fn
-
-def pagerank_builtin(g):
-    g.ndata['pv'] = g.ndata['pv'] / g.ndata['deg']
-    g.update_all(message_func=fn.copy_src(src='pv', out='m'),
-                 reduce_func=fn.sum(msg='m',out='m_sum'))
-    g.ndata['pv'] = (1 - DAMP) / N + DAMP * g.ndata['m_sum']
-
-
-###############################################################################
-# In the previous example code, you directly provide the UDFs to the :func:`update_all <DGLGraph.update_all>`
-# as its arguments.
-# This will override the previously registered UDFs.
-#
-# In addition to cleaner code, using ``builtin`` functions also gives DGL the
-# opportunity to fuse operations together. This results in faster execution.  For
-# example, DGL will fuse the ``copy_src`` message function and ``sum`` reduce
-# function into one sparse matrix-vector (spMV) multiplication.
-#
-# `The following section <spmv_>`_ describes why spMV can speed up the scatter-gather
-# phase in PageRank.  For more details about the ``builtin`` functions in DGL,
-# see :doc:`API reference <../../api/python/function>`.
-#
-# You can also download and run the different code examples to see the differences.
-
-for k in range(K):
-    # Uncomment the corresponding line to select different version.
-    # pagerank_naive(g)
-    # pagerank_batch(g)
-    # pagerank_level2(g)
-    pagerank_builtin(g)
-print(g.ndata['pv'])
-
-
-###############################################################################
-# .. _spmv:
-#
-# Using spMV for PageRank
-# -----------------------
-# Using ``builtin`` functions allows DGL to understand the semantics of UDFs.
-# This allows you to create an efficient implementation. For example, in the case
-# of PageRank, one common method to accelerate it is by using its linear algebra
-# form.
-#
-# .. math::
-#
-#    \mathbf{R}^{k} = \frac{1-d}{N} \mathbf{1} + d \mathbf{A}*\mathbf{R}^{k-1}
-#
-# Here, :math:`\mathbf{R}^k` is the vector of the PageRank values of all nodes
-# at iteration :math:`k`; :math:`\mathbf{A}` is the sparse adjacency matrix
-# of the graph.
-# Computing this equation is quite efficient because there is an efficient
-# GPU kernel for the sparse matrix-vector multiplication (spMV). DGL
-# detects whether such optimization is available through the ``builtin``
-# functions. If a certain combination of ``builtin`` can be mapped to an spMV
-# kernel (e.g., the PageRank example), DGL uses it automatically. We recommend 
-# using ``builtin`` functions whenever possible.
-
-
-###############################################################################
-# Next steps
-# ----------
-# 
-# * Learn how to use DGL (:doc:`builtin functions<../../features/builtin>`) to write 
-#   more efficient message passing.
-# * To see model tutorials, see the :doc:`overview page<../models/index>`.
-# * To learn about Graph Neural Networks, see :doc:`GCN tutorial<../models/1_gnn/1_gcn>`.
-# * To see how DGL batches multiple graphs, see :doc:`TreeLSTM tutorial<../models/2_small_graph/3_tree-lstm>`.
-# * Play with some graph generative models by following tutorial for :doc:`Deep Generative Model of Graphs<../models/3_generative_model/5_dgmg>`.
-# * To learn how traditional models are interpreted in a view of graph, see 
-#   the tutorials on :doc:`CapsuleNet<../models/4_old_wines/2_capsule>` and
-#   :doc:`Transformer<../models/4_old_wines/7_transformer>`.
diff --git a/tutorials/basics/4_batch.py b/tutorials/basics/4_batch.py
deleted file mode 100644
index 9440abafe116..000000000000
--- a/tutorials/basics/4_batch.py
+++ /dev/null
@@ -1,229 +0,0 @@
-"""
-.. currentmodule:: dgl
-
-Graph Classification Tutorial
-=============================
-
-**Author**: `Mufei Li <https://github.com/mufeili>`_,
-`Minjie Wang <https://jermainewang.github.io/>`_,
-`Zheng Zhang <https://shanghai.nyu.edu/academics/faculty/directory/zheng-zhang>`_.
-
-In this tutorial, you learn how to use DGL to batch multiple graphs of variable size and shape. The 
-tutorial also demonstrates training a graph neural network for a simple graph classification task.
-
-Graph classification is an important problem
-with applications across many fields, such as bioinformatics, chemoinformatics, social
-network analysis, urban computing, and cybersecurity. Applying graph neural
-networks to this problem has been a popular approach recently. This can be seen in the following reserach references: 
-`Ying et al., 2018 <https://arxiv.org/abs/1806.08804>`_,
-`Cangea et al., 2018 <https://arxiv.org/abs/1811.01287>`_,
-`Knyazev et al., 2018 <https://arxiv.org/abs/1811.09595>`_,
-`Bianchi et al., 2019 <https://arxiv.org/abs/1901.01343>`_,
-`Liao et al., 2019 <https://arxiv.org/abs/1901.01484>`_,
-`Gao et al., 2019 <https://openreview.net/forum?id=HJePRoAct7>`_).
-
-"""
-
-###############################################################################
-# Simple graph classification task
-# --------------------------------
-# In this tutorial, you learn how to perform batched graph classification
-# with DGL. The example task objective is to classify eight types of topologies shown here.
-#
-# .. image:: https://data.dgl.ai/tutorial/batch/dataset_overview.png
-#     :align: center
-#
-# Implement a synthetic dataset :class:`data.MiniGCDataset` in DGL. The dataset has eight 
-# different types of graphs and each class has the same number of graph samples.
-
-import dgl
-import torch
-from dgl.data import MiniGCDataset
-import matplotlib.pyplot as plt
-import networkx as nx
-# A dataset with 80 samples, each graph is
-# of size [10, 20]
-dataset = MiniGCDataset(80, 10, 20)
-graph, label = dataset[0]
-fig, ax = plt.subplots()
-nx.draw(graph.to_networkx(), ax=ax)
-ax.set_title('Class: {:d}'.format(label))
-plt.show()
-
-###############################################################################
-# Form a graph mini-batch
-# -----------------------
-# To train neural networks efficiently, a common practice is to batch
-# multiple samples together to form a mini-batch. Batching fixed-shaped tensor
-# inputs is common. For example, batching two images of size 28 x 28
-# gives a tensor of shape 2 x 28 x 28. By contrast, batching graph inputs
-# has two challenges:
-#
-# * Graphs are sparse.
-# * Graphs can have various length. For example, number of nodes and edges.
-#
-# To address this, DGL provides a :func:`dgl.batch` API. It leverages the idea that
-# a batch of graphs can be viewed as a large graph that has many disjointed 
-# connected components. Below is a visualization that gives the general idea.
-#
-# .. image:: https://data.dgl.ai/tutorial/batch/batch.png
-#     :width: 400pt
-#     :align: center
-#
-# The return type of :func:`dgl.batch` is still a graph. In the same way, 
-# a batch of tensors is still a tensor. This means that any code that works
-# for one graph immediately works for a batch of graphs. More importantly,
-# because DGL processes messages on all nodes and edges in parallel, this greatly
-# improves efficiency.
-#
-# Graph classifier
-# ----------------
-# Graph classification proceeds as follows.
-#
-# .. image:: https://data.dgl.ai/tutorial/batch/graph_classifier.png
-#
-# From a batch of graphs, perform message passing and graph convolution
-# for nodes to communicate with others. After message passing, compute a
-# tensor for graph representation from node (and edge) attributes. This step might 
-# be called readout or aggregation. Finally, the graph 
-# representations are fed into a classifier :math:`g` to predict the graph labels.
-#
-# Graph convolution layer can be found in the ``dgl.nn.<backend>`` submodule.
-
-from dgl.nn.pytorch import GraphConv
-
-###############################################################################
-# Readout and classification
-# --------------------------
-# For this demonstration, consider initial node features to be their degrees.
-# After two rounds of graph convolution, perform a graph readout by averaging
-# over all node features for each graph in the batch.
-#
-# .. math::
-#
-#    h_g=\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}h_{v}
-#
-# In DGL, :func:`dgl.mean_nodes` handles this task for a batch of
-# graphs with variable size. You then feed the graph representations into a
-# classifier with one linear layer to obtain pre-softmax logits.
-
-import torch.nn as nn
-import torch.nn.functional as F
-
-class Classifier(nn.Module):
-    def __init__(self, in_dim, hidden_dim, n_classes):
-        super(Classifier, self).__init__()
-        self.conv1 = GraphConv(in_dim, hidden_dim)
-        self.conv2 = GraphConv(hidden_dim, hidden_dim)
-        self.classify = nn.Linear(hidden_dim, n_classes)
-
-    def forward(self, g):
-        # Use node degree as the initial node feature. For undirected graphs, the in-degree
-        # is the same as the out_degree.
-        h = g.in_degrees().view(-1, 1).float()
-        # Perform graph convolution and activation function.
-        h = F.relu(self.conv1(g, h))
-        h = F.relu(self.conv2(g, h))
-        g.ndata['h'] = h
-        # Calculate graph representation by averaging all the node representations.
-        hg = dgl.mean_nodes(g, 'h')
-        return self.classify(hg)
-
-###############################################################################
-# Setup and training
-# ------------------
-# Create a synthetic dataset of :math:`400` graphs with :math:`10` ~
-# :math:`20` nodes. :math:`320` graphs constitute a training set and
-# :math:`80` graphs constitute a test set.
-
-import torch.optim as optim
-from dgl.dataloading import GraphDataLoader
-
-# Create training and test sets.
-trainset = MiniGCDataset(320, 10, 20)
-testset = MiniGCDataset(80, 10, 20)
-# Use DGL's GraphDataLoader. It by default handles the 
-# graph batching operation for every mini-batch.
-data_loader = GraphDataLoader(trainset, batch_size=32, shuffle=True)
-
-# Create model
-model = Classifier(1, 256, trainset.num_classes)
-loss_func = nn.CrossEntropyLoss()
-optimizer = optim.Adam(model.parameters(), lr=0.001)
-model.train()
-
-epoch_losses = []
-for epoch in range(80):
-    epoch_loss = 0
-    for iter, (bg, label) in enumerate(data_loader):
-        prediction = model(bg)
-        loss = loss_func(prediction, label)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-        epoch_loss += loss.detach().item()
-    epoch_loss /= (iter + 1)
-    print('Epoch {}, loss {:.4f}'.format(epoch, epoch_loss))
-    epoch_losses.append(epoch_loss)
-
-###############################################################################
-# The learning curve of a run is presented below.
-
-plt.title('cross entropy averaged over minibatches')
-plt.plot(epoch_losses)
-plt.show()
-
-###############################################################################
-# The trained model is evaluated on the test set created. To deploy
-# the tutorial, restrict the running time to get a higher
-# accuracy (:math:`80` % ~ :math:`90` %) than the ones printed below.
-
-model.eval()
-# Convert a list of tuples to two lists
-test_X, test_Y = map(list, zip(*testset))
-test_bg = dgl.batch(test_X)
-test_Y = torch.tensor(test_Y).float().view(-1, 1)
-probs_Y = torch.softmax(model(test_bg), 1)
-sampled_Y = torch.multinomial(probs_Y, 1)
-argmax_Y = torch.max(probs_Y, 1)[1].view(-1, 1)
-print('Accuracy of sampled predictions on the test set: {:.4f}%'.format(
-    (test_Y == sampled_Y.float()).sum().item() / len(test_Y) * 100))
-print('Accuracy of argmax predictions on the test set: {:4f}%'.format(
-    (test_Y == argmax_Y.float()).sum().item() / len(test_Y) * 100))
-
-###############################################################################
-# The animation here plots the probability that a trained model predicts the correct graph type.
-#
-# .. image:: https://data.dgl.ai/tutorial/batch/test_eval4.gif
-#
-# To understand the node and graph representations that a trained model learned,
-# we use `t-SNE, <https://lvdmaaten.github.io/tsne/>`_ for dimensionality reduction
-# and visualization.
-#
-# .. image:: https://data.dgl.ai/tutorial/batch/tsne_node2.png
-#     :align: center
-#
-# .. image:: https://data.dgl.ai/tutorial/batch/tsne_graph2.png
-#     :align: center
-#
-# The two small figures on the top separately visualize node representations after one and two
-# layers of graph convolution. The figure on the bottom visualizes
-# the pre-softmax logits for graphs as graph representations.
-#
-# While the visualization does suggest some clustering effects of the node features,
-# you would not expect a perfect result. Node degrees are deterministic for
-# these node features. The graph features are improved when separated.
-#
-# What's next?
-# ------------
-# Graph classification with graph neural networks is still a new field.
-# It's waiting for people to bring more exciting discoveries. The work requires 
-# mapping different graphs to different embeddings, while preserving
-# their structural similarity in the embedding space. To learn more about it, see 
-# `How Powerful Are Graph Neural Networks? <https://arxiv.org/abs/1810.00826>`_ a research paper  
-# published for the International Conference on Learning Representations 2019.
-#
-# For more examples about batched graph processing, see the following:
-#
-# * Tutorials for `Tree LSTM <https://docs.dgl.ai/tutorials/models/2_small_graph/3_tree-lstm.html>`_ and `Deep Generative Models of Graphs <https://docs.dgl.ai/tutorials/models/3_generative_model/5_dgmg.html>`_
-# * An example implementation of `Junction Tree VAE <https://github.com/dmlc/dgl/tree/master/examples/pytorch/jtnn>`_
diff --git a/tutorials/basics/5_hetero.py b/tutorials/basics/5_hetero.py
deleted file mode 100644
index 6c1dec0be2a8..000000000000
--- a/tutorials/basics/5_hetero.py
+++ /dev/null
@@ -1,406 +0,0 @@
-"""
-.. currentmodule:: dgl
-
-Working with Heterogeneous Graphs
-=================================
-
-**Author**: Quan Gan, `Minjie Wang <https://jermainewang.github.io/>`_, Mufei Li,
-George Karypis, Zheng Zhang
-
-In this tutorial, you learn about:
-
-* Examples of heterogenous graph data and typical applications.
-
-* Creating and manipulating a heterogenous graph in DGL.
-
-* Implementing `Relational-GCN <https://arxiv.org/abs/1703.06103>`_, a popular GNN model,
-  for heterogenous graph input.
-
-* Training a model to solve a node classification task.
-
-Heterogeneous graphs, or *heterographs* for short, are graphs that contain
-different types of nodes and edges. The different types of nodes and edges tend
-to have different types of attributes that are designed to capture the
-characteristics of each node and edge type. Within the context of
-graph neural networks, depending on their complexity, certain node and edge types
-might need to be modeled with representations that have a different number of dimensions.
-
-DGL supports graph neural network computations on such heterogeneous graphs, by
-using the heterograph class and its associated API.
-
-"""
-
-###############################################################################
-# Examples of heterographs
-# -----------------------
-# Many graph datasets represent relationships among various types of entities.
-# This section provides an overview for several graph use-cases that show such relationships 
-# and can have their data represented as heterographs.
-#
-# Citation graph 
-# ~~~~~~~~~~~~~~~
-# The Association for Computing Machinery publishes an `ACM dataset <https://aminer.org/citation>`_ that contains two
-# million papers, their authors, publication venues, and the other papers
-# that were cited. This information can be represented as a heterogeneous graph.
-#
-# The following diagram shows several entities in the ACM dataset and the relationships among them 
-# (taken from `Shi et al., 2015 <https://arxiv.org/pdf/1511.04854.pdf>`_).
-#
-# .. figure:: https://data.dgl.ai/tutorial/hetero/acm-example.png# 
-# 
-# This graph has three types of entities that correspond to papers, authors, and publication venues.
-# It also contains three types of edges that connect the following:
-#
-# * Authors with papers corresponding to *written-by* relationships
-#
-# * Papers with publication venues corresponding to *published-in* relationships
-#
-# * Papers with other papers corresponding to *cited-by* relationships
-#
-#
-# Recommender systems 
-# ~~~~~~~~~~~~~~~~~~~~ 
-# The datasets used in recommender systems often contain
-# interactions between users and items. For example, the data could include the
-# ratings that users have provided to movies. Such interactions can be modeled
-# as heterographs.
-#
-# The nodes in these heterographs will have two types, *users* and *movies*. The edges
-# will correspond to the user-movie interactions. Furthermore, if an interaction is
-# marked with a rating, then each rating value could correspond to a different edge type.
-# The following diagram shows an example of user-item interactions as a heterograph.
-#
-# .. figure:: https://data.dgl.ai/tutorial/hetero/recsys-example.png
-#
-#
-# Knowledge graph 
-# ~~~~~~~~~~~~~~~~
-# Knowledge graphs are inherently heterogenous. For example, in
-# Wikidata, Barack Obama (item Q76) is an instance of a human, which could be viewed as
-# the entity class, whose spouse (item P26) is Michelle Obama (item Q13133) and
-# occupation (item P106) is politician (item Q82955). The relationships are shown in the following.
-# diagram.
-#
-# .. figure:: https://data.dgl.ai/tutorial/hetero/kg-example.png
-#
-
-###############################################################################
-# Creating a heterograph in DGL
-# -----------------------------
-# You can create a heterograph in DGL using the :func:`dgl.heterograph` API.
-# The argument to :func:`dgl.heterograph` is a dictionary. The keys are tuples
-# in the form of ``(srctype, edgetype, dsttype)`` specifying the relation name
-# and the two entity types it connects. Such tuples are called *canonical edge types*
-# The values are data to initialize the graph structures, that is, which
-# nodes the edges actually connect.
-#
-# For instance, the following code creates the user-item interactions heterograph shown earlier.
-
-# Each value of the dictionary is a pair of source and destination arrays.
-# Nodes are integer IDs starting from zero. Nodes IDs of different types have
-# separate countings.
-import dgl
-import numpy as np
-
-ratings = dgl.heterograph(
-    {('user', '+1', 'movie') : (np.array([0, 0, 1]), np.array([0, 1, 0])),
-     ('user', '-1', 'movie') : (np.array([2]), np.array([1]))})
-
-###############################################################################
-# Manipulating heterograph
-# ------------------------
-# You can create a more realistic heterograph using the ACM dataset. To do this, first 
-# download the dataset as follows:
-
-import scipy.io
-import urllib.request
-
-data_url = 'https://data.dgl.ai/dataset/ACM.mat'
-data_file_path = '/tmp/ACM.mat'
-
-urllib.request.urlretrieve(data_url, data_file_path)
-data = scipy.io.loadmat(data_file_path)
-print(list(data.keys()))
-
-###############################################################################
-# The dataset stores node information by their types: ``P`` for paper, ``A``
-# for author, ``C`` for conference, ``L`` for subject code, and so on. The relationships
-# are stored as SciPy sparse matrix under key ``XvsY``, where ``X`` and ``Y``
-# could be any of the node type code.
-#
-# The following code prints out some statistics about the paper-author relationships.
-
-print(type(data['PvsA']))
-print('#Papers:', data['PvsA'].shape[0])
-print('#Authors:', data['PvsA'].shape[1])
-print('#Links:', data['PvsA'].nnz)
-
-###############################################################################
-# Converting this SciPy matrix to a heterograph in DGL is straightforward.
-
-pa_g = dgl.heterograph({('paper', 'written-by', 'author') : data['PvsA'].nonzero()})
-
-###############################################################################
-# You can easily print out the type names and other structural information.
-
-print('Node types:', pa_g.ntypes)
-print('Edge types:', pa_g.etypes)
-print('Canonical edge types:', pa_g.canonical_etypes)
-
-# Nodes and edges are assigned integer IDs starting from zero and each type has its own counting.
-# To distinguish the nodes and edges of different types, specify the type name as the argument.
-print(pa_g.number_of_nodes('paper'))
-# Canonical edge type name can be shortened to only one edge type name if it is
-# uniquely distinguishable.
-print(pa_g.number_of_edges(('paper', 'written-by', 'author')))
-print(pa_g.number_of_edges('written-by'))
-print(pa_g.successors(1, etype='written-by'))  # get the authors that write paper #1
-
-# Type name argument could be omitted whenever the behavior is unambiguous.
-print(pa_g.number_of_edges())  # Only one edge type, the edge type argument could be omitted
-
-###############################################################################
-# A homogeneous graph is just a special case of a heterograph with only one type
-# of node and edge.
-
-# Paper-citing-paper graph is a homogeneous graph
-pp_g = dgl.heterograph({('paper', 'citing', 'paper') : data['PvsP'].nonzero()})
-# equivalent (shorter) API for creating homogeneous graph
-pp_g = dgl.from_scipy(data['PvsP'])
-
-# All the ntype and etype arguments could be omitted because the behavior is unambiguous.
-print(pp_g.number_of_nodes())
-print(pp_g.number_of_edges())
-print(pp_g.successors(3))
-
-###############################################################################
-# Create a subset of the ACM graph using the paper-author, paper-paper, 
-# and paper-subject relationships.  Meanwhile, also add the reverse
-# relationship to prepare for the later sections.
-
-G = dgl.heterograph({
-        ('paper', 'written-by', 'author') : data['PvsA'].nonzero(),
-        ('author', 'writing', 'paper') : data['PvsA'].transpose().nonzero(),
-        ('paper', 'citing', 'paper') : data['PvsP'].nonzero(),
-        ('paper', 'cited', 'paper') : data['PvsP'].transpose().nonzero(),
-        ('paper', 'is-about', 'subject') : data['PvsL'].nonzero(),
-        ('subject', 'has', 'paper') : data['PvsL'].transpose().nonzero(),
-    })
-
-print(G)
-
-###############################################################################
-# **Metagraph** (or network schema) is a useful summary of a heterograph.
-# Serving as a template for a heterograph, it tells how many types of objects
-# exist in the network and where the possible links exist.
-#
-# DGL provides easy access to the metagraph, which could be visualized using
-# external tools.
-
-# Draw the metagraph using graphviz.
-import pygraphviz as pgv
-def plot_graph(nxg):
-    ag = pgv.AGraph(strict=False, directed=True)
-    for u, v, k in nxg.edges(keys=True):
-        ag.add_edge(u, v, label=k)
-    ag.layout('dot')
-    ag.draw('graph.png')
-
-plot_graph(G.metagraph())
-
-###############################################################################
-# Learning tasks associated with heterographs
-# -------------------------------------------
-# Some of the typical learning tasks that involve heterographs include:
-#
-# * *Node classification and regression* to predict the class of each node or
-#   estimate a value associated with it.
-#
-# * *Link prediction* to predict if there is an edge of a certain
-#   type between a pair of nodes, or predict which other nodes a particular
-#   node is connected with (and optionally the edge types of such connections).
-#
-# * *Graph classification/regression* to assign an entire
-#   heterograph into one of the target classes or to estimate a numerical
-#   value associated with it.
-#
-# In this tutorial, we designed a simple example for the first task.
-#
-# A semi-supervised node classification example
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-# Our goal is to predict the publishing conference of a paper using the ACM
-# academic graph we just created. To further simplify the task, we only focus
-# on papers published in three conferences: *KDD*, *ICML*, and *VLDB*. All
-# the other papers are not labeled, making it a semi-supervised setting.
-#
-# The following code extracts those papers from the raw dataset and prepares 
-# the training, validation, testing split.
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-pvc = data['PvsC'].tocsr()
-# find all papers published in KDD, ICML, VLDB
-c_selected = [0, 11, 13]  # KDD, ICML, VLDB
-p_selected = pvc[:, c_selected].tocoo()
-# generate labels
-labels = pvc.indices
-labels[labels == 11] = 1
-labels[labels == 13] = 2
-labels = torch.tensor(labels).long()
-
-# generate train/val/test split
-pid = p_selected.row
-shuffle = np.random.permutation(pid)
-train_idx = torch.tensor(shuffle[0:800]).long()
-val_idx = torch.tensor(shuffle[800:900]).long()
-test_idx = torch.tensor(shuffle[900:]).long()
-
-###############################################################################
-# Relational-GCN on heterograph
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-# We use `Relational-GCN <https://arxiv.org/abs/1703.06103>`_ to learn the
-# representation of nodes in the graph. Its message-passing equation is as
-# follows:
-#
-# .. math::
-#
-#    h_i^{(l+1)} = \sigma\left(\sum_{r\in \mathcal{R}}
-#    \sum_{j\in\mathcal{N}_r(i)}W_r^{(l)}h_j^{(l)}\right)
-#
-# Breaking down the equation, you see that there are two parts in the
-# computation.
-#
-# (i) Message computation and aggregation within each relation :math:`r`
-#
-# (ii) Reduction that merges the results from multiple relationships
-#
-# Following this intuition, perform message passing on a heterograph in
-# two steps.
-#
-# (i) Per-edge-type message passing
-#
-# (ii) Type wise reduction
-
-import dgl.function as fn
-
-class HeteroRGCNLayer(nn.Module):
-    def __init__(self, in_size, out_size, etypes):
-        super(HeteroRGCNLayer, self).__init__()
-        # W_r for each relation
-        self.weight = nn.ModuleDict({
-                name : nn.Linear(in_size, out_size) for name in etypes
-            })
-
-    def forward(self, G, feat_dict):
-        # The input is a dictionary of node features for each type
-        funcs = {}
-        for srctype, etype, dsttype in G.canonical_etypes:
-            # Compute W_r * h
-            Wh = self.weight[etype](feat_dict[srctype])
-            # Save it in graph for message passing
-            G.nodes[srctype].data['Wh_%s' % etype] = Wh
-            # Specify per-relation message passing functions: (message_func, reduce_func).
-            # Note that the results are saved to the same destination feature 'h', which
-            # hints the type wise reducer for aggregation.
-            funcs[etype] = (fn.copy_u('Wh_%s' % etype, 'm'), fn.mean('m', 'h'))
-        # Trigger message passing of multiple types.
-        # The first argument is the message passing functions for each relation.
-        # The second one is the type wise reducer, could be "sum", "max",
-        # "min", "mean", "stack"
-        G.multi_update_all(funcs, 'sum')
-        # return the updated node feature dictionary
-        return {ntype : G.nodes[ntype].data['h'] for ntype in G.ntypes}
-
-###############################################################################
-# Create a simple GNN by stacking two ``HeteroRGCNLayer``. Since the
-# nodes do not have input features, make their embeddings trainable.
-
-class HeteroRGCN(nn.Module):
-    def __init__(self, G, in_size, hidden_size, out_size):
-        super(HeteroRGCN, self).__init__()
-        # Use trainable node embeddings as featureless inputs.
-        embed_dict = {ntype : nn.Parameter(torch.Tensor(G.number_of_nodes(ntype), in_size))
-                      for ntype in G.ntypes}
-        for key, embed in embed_dict.items():
-            nn.init.xavier_uniform_(embed)
-        self.embed = nn.ParameterDict(embed_dict)
-        # create layers
-        self.layer1 = HeteroRGCNLayer(in_size, hidden_size, G.etypes)
-        self.layer2 = HeteroRGCNLayer(hidden_size, out_size, G.etypes)
-
-    def forward(self, G):
-        h_dict = self.layer1(G, self.embed)
-        h_dict = {k : F.leaky_relu(h) for k, h in h_dict.items()}
-        h_dict = self.layer2(G, h_dict)
-        # get paper logits
-        return h_dict['paper']
-
-###############################################################################
-# Train and evaluate
-# ~~~~~~~~~~~~~~~~~~
-# Train and evaluate this network.
-
-# Create the model. The output has three logits for three classes.
-model = HeteroRGCN(G, 10, 10, 3)
-
-opt = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
-
-best_val_acc = 0
-best_test_acc = 0
-
-for epoch in range(100):
-    logits = model(G)
-    # The loss is computed only for labeled nodes.
-    loss = F.cross_entropy(logits[train_idx], labels[train_idx])
-
-    pred = logits.argmax(1)
-    train_acc = (pred[train_idx] == labels[train_idx]).float().mean()
-    val_acc = (pred[val_idx] == labels[val_idx]).float().mean()
-    test_acc = (pred[test_idx] == labels[test_idx]).float().mean()
-
-    if best_val_acc < val_acc:
-        best_val_acc = val_acc
-        best_test_acc = test_acc
-
-    opt.zero_grad()
-    loss.backward()
-    opt.step()
-
-    if epoch % 5 == 0:
-        print('Loss %.4f, Train Acc %.4f, Val Acc %.4f (Best %.4f), Test Acc %.4f (Best %.4f)' % (
-            loss.item(),
-            train_acc.item(),
-            val_acc.item(),
-            best_val_acc.item(),
-            test_acc.item(),
-            best_test_acc.item(),
-        ))
-
-###############################################################################
-# What's next?
-# ------------
-# * Check out our full implementation in PyTorch
-#   `here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn-hetero>`_.
-#
-# * We also provide the following model examples:
-#
-#   * `Graph Convolutional Matrix Completion <https://arxiv.org/abs/1706.02263>_`,
-#     which we implement in MXNet
-#     `here <https://github.com/dmlc/dgl/tree/v0.4.0/examples/mxnet/gcmc>`_.
-#
-#   * `Heterogeneous Graph Attention Network <https://arxiv.org/abs/1903.07293>`_
-#     requires transforming a heterograph into a homogeneous graph according to
-#     a given metapath (i.e. a path template consisting of edge types).  We
-#     provide :func:`dgl.transform.metapath_reachable_graph` to do this.  See full
-#     implementation
-#     `here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/han>`_.
-#
-#   * `Metapath2vec <https://dl.acm.org/citation.cfm?id=3098036>`_ requires
-#     generating random walk paths according to a given metapath.  Please
-#     refer to the full metapath2vec implementation
-#     `here <https://github.com/dmlc/dgl/tree/master/examples/pytorch/metapath2vec>`_.
-#
-# * :doc:`Full heterograph API reference <../../api/python/heterograph>`.
diff --git a/tutorials/basics/README.txt b/tutorials/basics/README.txt
deleted file mode 100644
index e69de29bb2d1..000000000000
diff --git a/tutorials/blitz/.gitignore b/tutorials/blitz/.gitignore
new file mode 100644
index 000000000000..6e0150d44c55
--- /dev/null
+++ b/tutorials/blitz/.gitignore
@@ -0,0 +1,2 @@
+*.dgl
+*.csv
diff --git a/new-tutorial/blitz/1_introduction.py b/tutorials/blitz/1_introduction.py
similarity index 100%
rename from new-tutorial/blitz/1_introduction.py
rename to tutorials/blitz/1_introduction.py
diff --git a/new-tutorial/blitz/2_dglgraph.py b/tutorials/blitz/2_dglgraph.py
similarity index 100%
rename from new-tutorial/blitz/2_dglgraph.py
rename to tutorials/blitz/2_dglgraph.py
diff --git a/new-tutorial/blitz/3_message_passing.py b/tutorials/blitz/3_message_passing.py
similarity index 100%
rename from new-tutorial/blitz/3_message_passing.py
rename to tutorials/blitz/3_message_passing.py
diff --git a/new-tutorial/blitz/4_link_predict.py b/tutorials/blitz/4_link_predict.py
similarity index 100%
rename from new-tutorial/blitz/4_link_predict.py
rename to tutorials/blitz/4_link_predict.py
diff --git a/new-tutorial/blitz/5_graph_classification.py b/tutorials/blitz/5_graph_classification.py
similarity index 100%
rename from new-tutorial/blitz/5_graph_classification.py
rename to tutorials/blitz/5_graph_classification.py
diff --git a/new-tutorial/blitz/6_load_data.py b/tutorials/blitz/6_load_data.py
similarity index 100%
rename from new-tutorial/blitz/6_load_data.py
rename to tutorials/blitz/6_load_data.py
diff --git a/tutorials/blitz/README.txt b/tutorials/blitz/README.txt
new file mode 100644
index 000000000000..e38f98075ebe
--- /dev/null
+++ b/tutorials/blitz/README.txt
@@ -0,0 +1,2 @@
+A Blitz Introduction to DGL
+===========================
diff --git a/tutorials/large/.gitignore b/tutorials/large/.gitignore
new file mode 100644
index 000000000000..685fa130829e
--- /dev/null
+++ b/tutorials/large/.gitignore
@@ -0,0 +1,2 @@
+dataset
+model.pt
diff --git a/new-tutorial/large/L0_neighbor_sampling_overview.py b/tutorials/large/L0_neighbor_sampling_overview.py
similarity index 100%
rename from new-tutorial/large/L0_neighbor_sampling_overview.py
rename to tutorials/large/L0_neighbor_sampling_overview.py
diff --git a/new-tutorial/large/L1_large_node_classification.py b/tutorials/large/L1_large_node_classification.py
similarity index 100%
rename from new-tutorial/large/L1_large_node_classification.py
rename to tutorials/large/L1_large_node_classification.py
diff --git a/new-tutorial/large/L2_large_link_prediction.py b/tutorials/large/L2_large_link_prediction.py
similarity index 100%
rename from new-tutorial/large/L2_large_link_prediction.py
rename to tutorials/large/L2_large_link_prediction.py
diff --git a/new-tutorial/large/L4_message_passing.py b/tutorials/large/L4_message_passing.py
similarity index 99%
rename from new-tutorial/large/L4_message_passing.py
rename to tutorials/large/L4_message_passing.py
index 754266fec3a0..513d88d6f236 100644
--- a/new-tutorial/large/L4_message_passing.py
+++ b/tutorials/large/L4_message_passing.py
@@ -224,7 +224,7 @@ def forward(self, bipartites, x):
 #
 # Here is a step-by-step tutorial for writing a GNN module for both
 # :doc:`full-graph training <../blitz/1_introduction>` *and* :doc:`stochastic
-# training <L1_node_classification>`.
+# training <L1_large_node_classification>`.
 #
 # Say you start with a GNN module that works for full-graph training only:
 #
diff --git a/tutorials/large/README.txt b/tutorials/large/README.txt
new file mode 100644
index 000000000000..574b47192f95
--- /dev/null
+++ b/tutorials/large/README.txt
@@ -0,0 +1,2 @@
+Stochastic Training of GNNs
+===========================
diff --git a/tutorials/models/1_gnn/1_gcn.py b/tutorials/models/1_gnn/1_gcn.py
index 61b4673854e9..a7b9936b3b2a 100644
--- a/tutorials/models/1_gnn/1_gcn.py
+++ b/tutorials/models/1_gnn/1_gcn.py
@@ -7,16 +7,19 @@
 **Author:** `Qi Huang <https://github.com/HQ01>`_, `Minjie Wang  <https://jermainewang.github.io/>`_,
 Yu Gai, Quan Gan, Zheng Zhang
 
+.. warning::
+
+    The tutorial aims at gaining insights into the paper, with code as a mean
+    of explanation. The implementation thus is NOT optimized for running
+    efficiency. For recommended implementation, please refer to the `official
+    examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+
 This is a gentle introduction of using DGL to implement Graph Convolutional
 Networks (Kipf & Welling et al., `Semi-Supervised Classification with Graph
 Convolutional Networks <https://arxiv.org/pdf/1609.02907.pdf>`_). We explain
-what is under the hood of the :class:`~dgl.nn.pytorch.GraphConv` module.
+what is under the hood of the :class:`~dgl.nn.GraphConv` module.
 The reader is expected to learn how to define a new GNN layer using DGL's
 message passing APIs.
-
-We build upon the :doc:`earlier tutorial <../../basics/3_pagerank>` on DGLGraph
-and demonstrate how DGL combines graph with deep neural network and learn
-structural representations.
 """
 
 ###############################################################################
@@ -179,8 +182,7 @@ def evaluate(model, g, features, labels, mask):
 # The equation can be efficiently implemented using sparse matrix
 # multiplication kernels (such as Kipf's
 # `pygcn <https://github.com/tkipf/pygcn>`_ code). The above DGL implementation
-# in fact has already used this trick due to the use of builtin functions. To
-# understand what is under the hood, please read our tutorial on :doc:`PageRank <../../basics/3_pagerank>`.
+# in fact has already used this trick due to the use of builtin functions.
 #
 # Note that the tutorial code implements a simplified version of GCN where we
 # replace :math:`\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}` with
diff --git a/tutorials/models/1_gnn/4_rgcn.py b/tutorials/models/1_gnn/4_rgcn.py
index 01043babe3ad..1f4d8e873632 100644
--- a/tutorials/models/1_gnn/4_rgcn.py
+++ b/tutorials/models/1_gnn/4_rgcn.py
@@ -1,11 +1,18 @@
 """
 .. _model-rgcn:
 
-Relational graph convolutional network
+Relational Graph Convolutional Network
 ================================================
 
 **Author:** Lingfan Yu, Mufei Li, Zheng Zhang
 
+.. warning::
+
+    The tutorial aims at gaining insights into the paper, with code as a mean
+    of explanation. The implementation thus is NOT optimized for running
+    efficiency. For recommended implementation, please refer to the `official
+    examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+
 In this tutorial, you learn how to implement a relational graph convolutional
 network (R-GCN). This type of network is one effort to generalize GCN 
 to handle different relationships between entities in a knowledge base. To 
diff --git a/tutorials/models/1_gnn/6_line_graph.py b/tutorials/models/1_gnn/6_line_graph.py
index cf23765bde85..9926f566b534 100644
--- a/tutorials/models/1_gnn/6_line_graph.py
+++ b/tutorials/models/1_gnn/6_line_graph.py
@@ -1,11 +1,19 @@
 """
 .. _model-line-graph:
 
-Line graph neural network
+Line Graph Neural Network
 =========================
 
 **Author**: `Qi Huang <https://github.com/HQ01>`_, Yu Gai,
 `Minjie Wang <https://jermainewang.github.io/>`_, Zheng Zhang
+
+.. warning::
+
+    The tutorial aims at gaining insights into the paper, with code as a mean
+    of explanation. The implementation thus is NOT optimized for running
+    efficiency. For recommended implementation, please refer to the `official
+    examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+
 """
 
 ###########################################################################################
diff --git a/tutorials/models/1_gnn/9_gat.py b/tutorials/models/1_gnn/9_gat.py
index 3f1aa541d1da..444942b47c12 100644
--- a/tutorials/models/1_gnn/9_gat.py
+++ b/tutorials/models/1_gnn/9_gat.py
@@ -1,14 +1,21 @@
 """
 .. _model-gat:
 
-Graph attention network
-==================================
+Understand Graph Attention Network
+=======================================
 
 **Authors:** `Hao Zhang <https://github.com/sufeidechabei/>`_, `Mufei Li
 <https://github.com/mufeili>`_, `Minjie Wang
 <https://jermainewang.github.io/>`_  `Zheng Zhang
 <https://shanghai.nyu.edu/academics/faculty/directory/zheng-zhang>`_
 
+.. warning::
+
+    The tutorial aims at gaining insights into the paper, with code as a mean
+    of explanation. The implementation thus is NOT optimized for running
+    efficiency. For recommended implementation, please refer to the `official
+    examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+
 In this tutorial, you learn about a graph attention network (GAT) and how it can be 
 implemented in PyTorch. You can also learn to visualize and understand what the attention 
 mechanism has learned.
diff --git a/tutorials/models/1_gnn/README.txt b/tutorials/models/1_gnn/README.txt
index 1e80772ba867..2259e6188e1b 100644
--- a/tutorials/models/1_gnn/README.txt
+++ b/tutorials/models/1_gnn/README.txt
@@ -1,14 +1,13 @@
 .. _tutorials1-index:
 
 Graph neural networks and its variants
-====================================
+--------------------------------------------
 
 * **Graph convolutional network (GCN)** `[research paper] <https://arxiv.org/abs/1609.02907>`__ `[tutorial]
   <1_gnn/1_gcn.html>`__ `[Pytorch code]
   <https://github.com/dmlc/dgl/blob/master/examples/pytorch/gcn>`__
   `[MXNet code]
   <https://github.com/dmlc/dgl/tree/master/examples/mxnet/gcn>`__:
-  This is the most basic GCN. The tutorial covers the basic uses of DGL APIs.
 
 * **Graph attention network (GAT)** `[research paper] <https://arxiv.org/abs/1710.10903>`__ `[tutorial]
   <1_gnn/9_gat.html>`__ `[Pytorch code]
diff --git a/tutorials/models/2_small_graph/3_tree-lstm.py b/tutorials/models/2_small_graph/3_tree-lstm.py
index 32bec97b478f..32854669508e 100644
--- a/tutorials/models/2_small_graph/3_tree-lstm.py
+++ b/tutorials/models/2_small_graph/3_tree-lstm.py
@@ -1,12 +1,20 @@
 """
 .. _model-tree-lstm:
 
-Tutorial: Tree-LSTM in DGL
+Tree-LSTM in DGL
 ==========================
 
 **Author**: Zihao Ye, Qipeng Guo, `Minjie Wang
 <https://jermainewang.github.io/>`_, `Jake Zhao
 <https://cs.nyu.edu/~jakezhao/>`_, Zheng Zhang
+
+.. warning::
+
+    The tutorial aims at gaining insights into the paper, with code as a mean
+    of explanation. The implementation thus is NOT optimized for running
+    efficiency. For recommended implementation, please refer to the `official
+    examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+
 """
  
 ##############################################################################
diff --git a/tutorials/models/2_small_graph/README.txt b/tutorials/models/2_small_graph/README.txt
index 0981ca83289e..998d63e1e6fc 100644
--- a/tutorials/models/2_small_graph/README.txt
+++ b/tutorials/models/2_small_graph/README.txt
@@ -1,7 +1,7 @@
 .. _tutorials2-index:
 
 Batching many small graphs
-==============================
+-------------------------------
 
 * **Tree-LSTM** `[paper] <https://arxiv.org/abs/1503.00075>`__ `[tutorial]
   <2_small_graph/3_tree-lstm.html>`__ `[PyTorch code]
diff --git a/tutorials/models/3_generative_model/5_dgmg.py b/tutorials/models/3_generative_model/5_dgmg.py
index ad8cb0cce8a0..c9fcd9bf4f69 100644
--- a/tutorials/models/3_generative_model/5_dgmg.py
+++ b/tutorials/models/3_generative_model/5_dgmg.py
@@ -1,11 +1,19 @@
 """
 .. _model-dgmg:
 
-Tutorial: Generative models of graphs
+Generative Models of Graphs
 ===========================================
 
 **Author**: `Mufei Li <https://github.com/mufeili>`_,
 `Lingfan Yu <https://github.com/ylfdq1118>`_, Zheng Zhang
+
+.. warning::
+
+    The tutorial aims at gaining insights into the paper, with code as a mean
+    of explanation. The implementation thus is NOT optimized for running
+    efficiency. For recommended implementation, please refer to the `official
+    examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+
 """
 
 ##############################################################################
diff --git a/tutorials/models/3_generative_model/README.txt b/tutorials/models/3_generative_model/README.txt
index ccfdf9279076..3fa834e86b07 100644
--- a/tutorials/models/3_generative_model/README.txt
+++ b/tutorials/models/3_generative_model/README.txt
@@ -1,7 +1,7 @@
 .. _tutorials3-index:
 
 Generative models
-==================
+--------------------
 
 * **DGMG** `[paper] <https://arxiv.org/abs/1803.03324>`__ `[tutorial]
   <3_generative_model/5_dgmg.html>`__ `[PyTorch code]
@@ -12,10 +12,3 @@ Generative models
   sample has a dynamic, probability-driven structure that is not available
   before training. You can progressively leverage intra- and
   inter-graph parallelism to steadily improve the performance.
-
-* **JTNN** `[paper] <https://arxiv.org/abs/1802.04364>`__ `[PyTorch code]
-  <https://github.com/dmlc/dgl/tree/master/examples/pytorch/jtnn>`__:
-  This network generates molecular graphs using the framework of
-  a variational auto-encoder. The junction tree neural network (JTNN) builds
-  structure hierarchically. In the case of molecular graphs, it uses a junction tree as
-  the middle scaffolding.
diff --git a/tutorials/models/4_old_wines/2_capsule.py b/tutorials/models/4_old_wines/2_capsule.py
index 48eee059668f..985884862839 100644
--- a/tutorials/models/4_old_wines/2_capsule.py
+++ b/tutorials/models/4_old_wines/2_capsule.py
@@ -1,7 +1,7 @@
 """
 .. _model-capsule:
 
-Capsule network tutorial
+Capsule Network
 ===========================
 
 **Author**: Jinjing Zhou, `Jake Zhao <https://cs.nyu.edu/~jakezhao/>`_, Zheng Zhang, Jinyang Li
@@ -9,6 +9,14 @@
 In this tutorial, you learn how to describe one of the more classical models in terms of graphs. The approach
 offers a different perspective. The tutorial describes how to implement a Capsule model for the
 `capsule network <http://arxiv.org/abs/1710.09829>`__.
+
+.. warning::
+
+    The tutorial aims at gaining insights into the paper, with code as a mean
+    of explanation. The implementation thus is NOT optimized for running
+    efficiency. For recommended implementation, please refer to the `official
+    examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+
 """
 #######################################################################################
 # Key ideas of Capsule
diff --git a/tutorials/models/4_old_wines/7_transformer.py b/tutorials/models/4_old_wines/7_transformer.py
index 20161bd7726f..520849ea301d 100644
--- a/tutorials/models/4_old_wines/7_transformer.py
+++ b/tutorials/models/4_old_wines/7_transformer.py
@@ -1,10 +1,18 @@
 """
 .. _model-transformer:
 
-Transformer tutorial
-====================
+Transformer as a Graph Neural Network
+======================================
 
 **Author**: Zihao Ye, Jinjing Zhou, Qipeng Guo, Quan Gan, Zheng Zhang
+
+.. warning::
+
+    The tutorial aims at gaining insights into the paper, with code as a mean
+    of explanation. The implementation thus is NOT optimized for running
+    efficiency. For recommended implementation, please refer to the `official
+    examples <https://github.com/dmlc/dgl/tree/master/examples>`_.
+
 """
 ################################################################################################
 # In this tutorial, you learn about a simplified implementation of the Transformer model.
diff --git a/tutorials/models/4_old_wines/README.txt b/tutorials/models/4_old_wines/README.txt
index e6bee9f0567a..d3b29987c459 100644
--- a/tutorials/models/4_old_wines/README.txt
+++ b/tutorials/models/4_old_wines/README.txt
@@ -2,7 +2,7 @@
 
 
 Revisit classic models from a graph perspective
-====================================
+-------------------------------------------------------
 
 * **Capsule** `[paper] <https://arxiv.org/abs/1710.09829>`__ `[tutorial]
   <4_old_wines/2_capsule.html>`__ `[PyTorch code]
diff --git a/tutorials/models/README.txt b/tutorials/models/README.txt
index e69de29bb2d1..73778720649b 100644
--- a/tutorials/models/README.txt
+++ b/tutorials/models/README.txt
@@ -0,0 +1,2 @@
+Paper Study with DGL
+=========================================