Skip to content

Commit

Permalink
Master refactor split chapter2n3 (dmlc#2215)
Browse files Browse the repository at this point in the history
* [Feature] Add full graph training with dgl built-in dataset.

* [Feature] Add full graph training with dgl built-in dataset.

* [Feature] Add full graph training with dgl built-in dataset.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Bug] fix model to cuda.

* [Feature] Add test loss and accuracy

* [Feature] Add test loss and accuracy

* [Feature] Add test loss and accuracy

* [Feature] Add test loss and accuracy

* [Feature] Add test loss and accuracy

* [Feature] Add test loss and accuracy

* [Fix] Add random

* [Bug] Fix batch norm error

* [Doc] Test with CN in Sphinx

* [Doc] Test with CN in Sphinx

* [Doc] Remove the test CN docs.

* [Feature] Add input embedding layer

* [Feature] Add input embedding layer

* [Feature] Add input embedding layer

* [Feature] Add input embedding layer

* [Feature] Add input embedding layer

* [Feature] Add input embedding layer

* [Feature] Add input embedding layer

* [Feature] Add input embedding layer

* [Feature] Add input embedding layer

* [Doc] fill readme with new performance results

* [Doc] Add Chinese User Guide, graph and 1.5

* [Doc] Add Chinese User Guide, graph and 1.5

* [Doc] Refactor and split chapter 4

* [Fix] Remove CompGCN example codes

* [Doc] Add chapter 2 refactor and split

* [Fix] code format of savenload

* [Doc] Split chapter 3

* [Doc] Add introduction phrase of chapter 2

* [Doc] Add introduction phrase of chapter 2

* [Doc] Add introduction phrase of chapter 3

* Fix

* Update chapter 2

* Update chapter 3

* Update chapter 4

Co-authored-by: mufeili <[email protected]>
  • Loading branch information
zhjwy9343 and mufeili authored Sep 20, 2020
1 parent 49e9697 commit 90d86fc
Show file tree
Hide file tree
Showing 16 changed files with 1,294 additions and 1,202 deletions.
100 changes: 100 additions & 0 deletions docs/source/guide/data-dataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
.. _guide-data-pipeline-dataset:

4.1 DGLDataset class
--------------------

:class:`~dgl.data.DGLDataset` is the base class for processing, loading and saving
graph datasets defined in :ref:`apidata`. It implements the basic pipeline
for processing graph data. The following flow chart shows how the
pipeline works.

To process a graph dataset located in a remote server or local disk, one can
define a class, say ``MyDataset``, inheriting from :class:`dgl.data.DGLDataset`. The
template of ``MyDataset`` is as follows.

.. figure:: https://data.dgl.ai/asset/image/userguide_data_flow.png
:align: center

Flow chart for graph data input pipeline defined in class DGLDataset.

.. code::
from dgl.data import DGLDataset
class MyDataset(DGLDataset):
""" Template for customizing graph datasets in DGL.
Parameters
----------
url : str
URL to download the raw dataset
raw_dir : str
Specifying the directory that will store the
downloaded data or the directory that
already stores the input data.
Default: ~/.dgl/
save_dir : str
Directory to save the processed dataset.
Default: the value of `raw_dir`
force_reload : bool
Whether to reload the dataset. Default: False
verbose : bool
Whether to print out progress information
"""
def __init__(self,
url=None,
raw_dir=None,
save_dir=None,
force_reload=False,
verbose=False):
super(MyDataset, self).__init__(name='dataset_name',
url=url,
raw_dir=raw_dir,
save_dir=save_dir,
force_reload=force_reload,
verbose=verbose)
def download(self):
# download raw data to local disk
pass
def process(self):
# process raw data to graphs, labels, splitting masks
pass
def __getitem__(self, idx):
# get one example by index
pass
def __len__(self):
# number of data examples
pass
def save(self):
# save processed data to directory `self.save_path`
pass
def load(self):
# load processed data from directory `self.save_path`
pass
def has_cache(self):
# check whether there are processed data in `self.save_path`
pass
:class:`~dgl.data.DGLDataset` class has abstract functions ``process()``,
``__getitem__(idx)`` and ``__len__()`` that must be implemented in the
subclass. DGL also recommends implementing saving and loading as well,
since they can save significant time for processing large datasets, and
there are several APIs making it easy (see :ref:`guide-data-pipeline-savenload`).

Note that the purpose of :class:`~dgl.data.DGLDataset` is to provide a standard and
convenient way to load graph data. One can store graphs, features,
labels, masks and basic information about the dataset, such as number of
classes, number of labels, etc. Operations such as sampling, partition
or feature normalization are done outside of the :class:`~dgl.data.DGLDataset`
subclass.

The rest of this chapter shows the best practices to implement the
functions in the pipeline.
56 changes: 56 additions & 0 deletions docs/source/guide/data-download.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
.. _guide-data-pipeline-download:

4.2 Download raw data (optional)
--------------------------------

If a dataset is already in local disk, make sure it’s in directory
``raw_dir``. If one wants to run the code anywhere without bothering to
download and move data to the right directory, one can do it
automatically by implementing function ``download()``.

If the dataset is a zip file, make ``MyDataset`` inherit from
:class:`dgl.data.DGLBuiltinDataset` class, which handles the zip file extraction for us. Otherwise,
one needs to implement ``download()`` like in :class:`~dgl.data.QM7bDataset`:

.. code::
import os
from dgl.data.utils import download
def download(self):
# path to store the file
file_path = os.path.join(self.raw_dir, self.name + '.mat')
# download file
download(self.url, path=file_path)
The above code downloads a .mat file to directory ``self.raw_dir``. If
the file is a .gz, .tar, .tar.gz or .tgz file, use :func:`~dgl.data.utils.extract_archive`
function to extract. The following code shows how to download a .gz file
in :class:`~dgl.data.BitcoinOTCDataset`:

.. code::
from dgl.data.utils import download, check_sha1
def download(self):
# path to store the file
# make sure to use the same suffix as the original file name's
gz_file_path = os.path.join(self.raw_dir, self.name + '.csv.gz')
# download file
download(self.url, path=gz_file_path)
# check SHA-1
if not check_sha1(gz_file_path, self._sha1_str):
raise UserWarning('File {} is downloaded but the content hash does not match.'
'The repo may be outdated or download may be incomplete. '
'Otherwise you can create an issue for it.'.format(self.name + '.csv.gz'))
# extract file to directory `self.name` under `self.raw_dir`
self._extract_gz(gz_file_path, self.raw_path)
The above code will extract the file into directory ``self.name`` under
``self.raw_dir``. If the class inherits from :class:`dgl.data.DGLBuiltinDataset`
to handle zip file, it will extract the file into directory ``self.name``
as well.

Optionally, one can check SHA-1 string of the downloaded file as the
example above does, in case the author changed the file in the remote
server some day.
77 changes: 77 additions & 0 deletions docs/source/guide/data-loadogb.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
.. _guide-data-pipeline-loadogb:

4.5 Loading OGB datasets using ``ogb`` package
----------------------------------------------

`Open Graph Benchmark (OGB) <https://ogb.stanford.edu/docs/home/>`__ is
a collection of benchmark datasets. The official OGB package
`ogb <https://github.com/snap-stanford/ogb>`__ provides APIs for
downloading and processing OGB datasets into :class:`dgl.data.DGLGraph` objects. The section
introduce their basic usage here.

First install ogb package using pip:

.. code::
pip install ogb
The following code shows how to load datasets for *Graph Property
Prediction* tasks.

.. code::
# Load Graph Property Prediction datasets in OGB
import dgl
import torch
from ogb.graphproppred import DglGraphPropPredDataset
from torch.utils.data import DataLoader
def _collate_fn(batch):
# batch is a list of tuple (graph, label)
graphs = [e[0] for e in batch]
g = dgl.batch(graphs)
labels = [e[1] for e in batch]
labels = torch.stack(labels, 0)
return g, labels
# load dataset
dataset = DglGraphPropPredDataset(name='ogbg-molhiv')
split_idx = dataset.get_idx_split()
# dataloader
train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, collate_fn=_collate_fn)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, collate_fn=_collate_fn)
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, collate_fn=_collate_fn)
Loading *Node Property Prediction* datasets is similar, but note that
there is only one graph object in this kind of dataset.

.. code::
# Load Node Property Prediction datasets in OGB
from ogb.nodeproppred import DglNodePropPredDataset
dataset = DglNodePropPredDataset(name='ogbn-proteins')
split_idx = dataset.get_idx_split()
# there is only one graph in Node Property Prediction datasets
g, labels = dataset[0]
# get split labels
train_label = dataset.labels[split_idx['train']]
valid_label = dataset.labels[split_idx['valid']]
test_label = dataset.labels[split_idx['test']]
*Link Property Prediction* datasets also contain one graph per dataset:

.. code::
# Load Link Property Prediction datasets in OGB
from ogb.linkproppred import DglLinkPropPredDataset
dataset = DglLinkPropPredDataset(name='ogbl-ppa')
split_edge = dataset.get_edge_split()
graph = dataset[0]
print(split_edge['train'].keys())
print(split_edge['valid'].keys())
print(split_edge['test'].keys())
Loading

0 comments on commit 90d86fc

Please sign in to comment.