Master refactor split chapter2n3 (dmlc#2215)

* [Feature] Add full graph training with dgl built-in dataset. * [Feature] Add full graph training with dgl built-in dataset. * [Feature] Add full graph training with dgl built-in dataset. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Fix] Add random * [Bug] Fix batch norm error * [Doc] Test with CN in Sphinx * [Doc] Test with CN in Sphinx * [Doc] Remove the test CN docs. * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Doc] fill readme with new performance results * [Doc] Add Chinese User Guide, graph and 1.5 * [Doc] Add Chinese User Guide, graph and 1.5 * [Doc] Refactor and split chapter 4 * [Fix] Remove CompGCN example codes * [Doc] Add chapter 2 refactor and split * [Fix] code format of savenload * [Doc] Split chapter 3 * [Doc] Add introduction phrase of chapter 2 * [Doc] Add introduction phrase of chapter 2 * [Doc] Add introduction phrase of chapter 3 * Fix * Update chapter 2 * Update chapter 3 * Update chapter 4 Co-authored-by: mufeili <[email protected]>
feixian15 · Sep 20, 2020 · 90d86fc · 90d86fc
1 parent 49e9697
commit 90d86fc
Show file tree

Hide file tree

Showing 16 changed files with 1,294 additions and 1,202 deletions.
diff --git a/docs/source/guide/data-dataset.rst b/docs/source/guide/data-dataset.rst
@@ -0,0 +1,100 @@
+.. _guide-data-pipeline-dataset:
+
+4.1 DGLDataset class
+--------------------
+
+:class:`~dgl.data.DGLDataset` is the base class for processing, loading and saving
+graph datasets defined in :ref:`apidata`. It implements the basic pipeline
+for processing graph data. The following flow chart shows how the
+pipeline works.
+
+To process a graph dataset located in a remote server or local disk, one can
+define a class, say ``MyDataset``, inheriting from :class:`dgl.data.DGLDataset`. The
+template of ``MyDataset`` is as follows.
+
+.. figure:: https://data.dgl.ai/asset/image/userguide_data_flow.png
+    :align: center
+
+    Flow chart for graph data input pipeline defined in class DGLDataset.
+
+.. code:: 
+
+    from dgl.data import DGLDataset
+    
+    class MyDataset(DGLDataset):
+        """ Template for customizing graph datasets in DGL.
+    
+        Parameters
+        ----------
+        url : str
+            URL to download the raw dataset
+        raw_dir : str
+            Specifying the directory that will store the 
+            downloaded data or the directory that
+            already stores the input data.
+            Default: ~/.dgl/
+        save_dir : str
+            Directory to save the processed dataset.
+            Default: the value of `raw_dir`
+        force_reload : bool
+            Whether to reload the dataset. Default: False
+        verbose : bool
+            Whether to print out progress information
+        """
+        def __init__(self, 
+                     url=None, 
+                     raw_dir=None, 
+                     save_dir=None, 
+                     force_reload=False, 
+                     verbose=False):
+            super(MyDataset, self).__init__(name='dataset_name',
+                                            url=url,
+                                            raw_dir=raw_dir,
+                                            save_dir=save_dir,
+                                            force_reload=force_reload,
+                                            verbose=verbose)
+    
+        def download(self):
+            # download raw data to local disk
+            pass
+    
+        def process(self):
+            # process raw data to graphs, labels, splitting masks
+            pass
+        
+        def __getitem__(self, idx):
+            # get one example by index
+            pass
+    
+        def __len__(self):
+            # number of data examples
+            pass
+    
+        def save(self):
+            # save processed data to directory `self.save_path`
+            pass
+    
+        def load(self):
+            # load processed data from directory `self.save_path`
+            pass
+    
+        def has_cache(self):
+            # check whether there are processed data in `self.save_path`
+            pass
+
+
+:class:`~dgl.data.DGLDataset` class has abstract functions ``process()``,
+``__getitem__(idx)`` and ``__len__()`` that must be implemented in the
+subclass. DGL also recommends implementing saving and loading as well,
+since they can save significant time for processing large datasets, and
+there are several APIs making it easy (see :ref:`guide-data-pipeline-savenload`).
+
+Note that the purpose of :class:`~dgl.data.DGLDataset` is to provide a standard and
+convenient way to load graph data. One can store graphs, features,
+labels, masks and basic information about the dataset, such as number of
+classes, number of labels, etc. Operations such as sampling, partition
+or feature normalization are done outside of the :class:`~dgl.data.DGLDataset`
+subclass.
+
+The rest of this chapter shows the best practices to implement the
+functions in the pipeline.
diff --git a/docs/source/guide/data-download.rst b/docs/source/guide/data-download.rst
@@ -0,0 +1,56 @@
+.. _guide-data-pipeline-download:
+
+4.2 Download raw data (optional)
+--------------------------------
+
+If a dataset is already in local disk, make sure it’s in directory
+``raw_dir``. If one wants to run the code anywhere without bothering to
+download and move data to the right directory, one can do it
+automatically by implementing function ``download()``.
+
+If the dataset is a zip file, make ``MyDataset`` inherit from
+:class:`dgl.data.DGLBuiltinDataset` class, which handles the zip file extraction for us. Otherwise,
+one needs to implement ``download()`` like in :class:`~dgl.data.QM7bDataset`:
+
+.. code:: 
+
+    import os
+    from dgl.data.utils import download
+    
+    def download(self):
+        # path to store the file
+        file_path = os.path.join(self.raw_dir, self.name + '.mat')
+        # download file
+        download(self.url, path=file_path)
+
+The above code downloads a .mat file to directory ``self.raw_dir``. If
+the file is a .gz, .tar, .tar.gz or .tgz file, use :func:`~dgl.data.utils.extract_archive`
+function to extract. The following code shows how to download a .gz file
+in :class:`~dgl.data.BitcoinOTCDataset`:
+
+.. code:: 
+
+    from dgl.data.utils import download, check_sha1
+    
+    def download(self):
+        # path to store the file
+        # make sure to use the same suffix as the original file name's
+        gz_file_path = os.path.join(self.raw_dir, self.name + '.csv.gz')
+        # download file
+        download(self.url, path=gz_file_path)
+        # check SHA-1
+        if not check_sha1(gz_file_path, self._sha1_str):
+            raise UserWarning('File {} is downloaded but the content hash does not match.'
+                              'The repo may be outdated or download may be incomplete. '
+                              'Otherwise you can create an issue for it.'.format(self.name + '.csv.gz'))
+        # extract file to directory `self.name` under `self.raw_dir`
+        self._extract_gz(gz_file_path, self.raw_path)
+
+The above code will extract the file into directory ``self.name`` under
+``self.raw_dir``. If the class inherits from :class:`dgl.data.DGLBuiltinDataset`
+to handle zip file, it will extract the file into directory ``self.name`` 
+as well.
+
+Optionally, one can check SHA-1 string of the downloaded file as the
+example above does, in case the author changed the file in the remote
+server some day.
diff --git a/docs/source/guide/data-loadogb.rst b/docs/source/guide/data-loadogb.rst
@@ -0,0 +1,77 @@
+.. _guide-data-pipeline-loadogb:
+
+4.5 Loading OGB datasets using ``ogb`` package
+----------------------------------------------
+
+`Open Graph Benchmark (OGB) <https://ogb.stanford.edu/docs/home/>`__ is
+a collection of benchmark datasets. The official OGB package
+`ogb <https://github.com/snap-stanford/ogb>`__ provides APIs for
+downloading and processing OGB datasets into :class:`dgl.data.DGLGraph` objects. The section
+introduce their basic usage here.
+
+First install ogb package using pip:
+
+.. code:: 
+
+    pip install ogb
+
+The following code shows how to load datasets for *Graph Property
+Prediction* tasks.
+
+.. code:: 
+
+    # Load Graph Property Prediction datasets in OGB
+    import dgl
+    import torch
+    from ogb.graphproppred import DglGraphPropPredDataset
+    from torch.utils.data import DataLoader
+    
+    
+    def _collate_fn(batch):
+        # batch is a list of tuple (graph, label)
+        graphs = [e[0] for e in batch]
+        g = dgl.batch(graphs)
+        labels = [e[1] for e in batch]
+        labels = torch.stack(labels, 0)
+        return g, labels
+    
+    # load dataset
+    dataset = DglGraphPropPredDataset(name='ogbg-molhiv')
+    split_idx = dataset.get_idx_split()
+    # dataloader
+    train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, collate_fn=_collate_fn)
+    valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, collate_fn=_collate_fn)
+    test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, collate_fn=_collate_fn)
+
+Loading *Node Property Prediction* datasets is similar, but note that
+there is only one graph object in this kind of dataset.
+
+.. code:: 
+
+    # Load Node Property Prediction datasets in OGB
+    from ogb.nodeproppred import DglNodePropPredDataset
+    
+    dataset = DglNodePropPredDataset(name='ogbn-proteins')
+    split_idx = dataset.get_idx_split()
+    
+    # there is only one graph in Node Property Prediction datasets
+    g, labels = dataset[0]
+    # get split labels
+    train_label = dataset.labels[split_idx['train']]
+    valid_label = dataset.labels[split_idx['valid']]
+    test_label = dataset.labels[split_idx['test']]
+
+*Link Property Prediction* datasets also contain one graph per dataset:
+
+.. code:: 
+
+    # Load Link Property Prediction datasets in OGB
+    from ogb.linkproppred import DglLinkPropPredDataset
+    
+    dataset = DglLinkPropPredDataset(name='ogbl-ppa')
+    split_edge = dataset.get_edge_split()
+    
+    graph = dataset[0]
+    print(split_edge['train'].keys())
+    print(split_edge['valid'].keys())
+    print(split_edge['test'].keys())