-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Master refactor split chapter2n3 (dmlc#2215)
* [Feature] Add full graph training with dgl built-in dataset. * [Feature] Add full graph training with dgl built-in dataset. * [Feature] Add full graph training with dgl built-in dataset. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Bug] fix model to cuda. * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Feature] Add test loss and accuracy * [Fix] Add random * [Bug] Fix batch norm error * [Doc] Test with CN in Sphinx * [Doc] Test with CN in Sphinx * [Doc] Remove the test CN docs. * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Feature] Add input embedding layer * [Doc] fill readme with new performance results * [Doc] Add Chinese User Guide, graph and 1.5 * [Doc] Add Chinese User Guide, graph and 1.5 * [Doc] Refactor and split chapter 4 * [Fix] Remove CompGCN example codes * [Doc] Add chapter 2 refactor and split * [Fix] code format of savenload * [Doc] Split chapter 3 * [Doc] Add introduction phrase of chapter 2 * [Doc] Add introduction phrase of chapter 2 * [Doc] Add introduction phrase of chapter 3 * Fix * Update chapter 2 * Update chapter 3 * Update chapter 4 Co-authored-by: mufeili <[email protected]>
- Loading branch information
Showing
16 changed files
with
1,294 additions
and
1,202 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
.. _guide-data-pipeline-dataset: | ||
|
||
4.1 DGLDataset class | ||
-------------------- | ||
|
||
:class:`~dgl.data.DGLDataset` is the base class for processing, loading and saving | ||
graph datasets defined in :ref:`apidata`. It implements the basic pipeline | ||
for processing graph data. The following flow chart shows how the | ||
pipeline works. | ||
|
||
To process a graph dataset located in a remote server or local disk, one can | ||
define a class, say ``MyDataset``, inheriting from :class:`dgl.data.DGLDataset`. The | ||
template of ``MyDataset`` is as follows. | ||
|
||
.. figure:: https://data.dgl.ai/asset/image/userguide_data_flow.png | ||
:align: center | ||
|
||
Flow chart for graph data input pipeline defined in class DGLDataset. | ||
|
||
.. code:: | ||
from dgl.data import DGLDataset | ||
class MyDataset(DGLDataset): | ||
""" Template for customizing graph datasets in DGL. | ||
Parameters | ||
---------- | ||
url : str | ||
URL to download the raw dataset | ||
raw_dir : str | ||
Specifying the directory that will store the | ||
downloaded data or the directory that | ||
already stores the input data. | ||
Default: ~/.dgl/ | ||
save_dir : str | ||
Directory to save the processed dataset. | ||
Default: the value of `raw_dir` | ||
force_reload : bool | ||
Whether to reload the dataset. Default: False | ||
verbose : bool | ||
Whether to print out progress information | ||
""" | ||
def __init__(self, | ||
url=None, | ||
raw_dir=None, | ||
save_dir=None, | ||
force_reload=False, | ||
verbose=False): | ||
super(MyDataset, self).__init__(name='dataset_name', | ||
url=url, | ||
raw_dir=raw_dir, | ||
save_dir=save_dir, | ||
force_reload=force_reload, | ||
verbose=verbose) | ||
def download(self): | ||
# download raw data to local disk | ||
pass | ||
def process(self): | ||
# process raw data to graphs, labels, splitting masks | ||
pass | ||
def __getitem__(self, idx): | ||
# get one example by index | ||
pass | ||
def __len__(self): | ||
# number of data examples | ||
pass | ||
def save(self): | ||
# save processed data to directory `self.save_path` | ||
pass | ||
def load(self): | ||
# load processed data from directory `self.save_path` | ||
pass | ||
def has_cache(self): | ||
# check whether there are processed data in `self.save_path` | ||
pass | ||
:class:`~dgl.data.DGLDataset` class has abstract functions ``process()``, | ||
``__getitem__(idx)`` and ``__len__()`` that must be implemented in the | ||
subclass. DGL also recommends implementing saving and loading as well, | ||
since they can save significant time for processing large datasets, and | ||
there are several APIs making it easy (see :ref:`guide-data-pipeline-savenload`). | ||
|
||
Note that the purpose of :class:`~dgl.data.DGLDataset` is to provide a standard and | ||
convenient way to load graph data. One can store graphs, features, | ||
labels, masks and basic information about the dataset, such as number of | ||
classes, number of labels, etc. Operations such as sampling, partition | ||
or feature normalization are done outside of the :class:`~dgl.data.DGLDataset` | ||
subclass. | ||
|
||
The rest of this chapter shows the best practices to implement the | ||
functions in the pipeline. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
.. _guide-data-pipeline-download: | ||
|
||
4.2 Download raw data (optional) | ||
-------------------------------- | ||
|
||
If a dataset is already in local disk, make sure it’s in directory | ||
``raw_dir``. If one wants to run the code anywhere without bothering to | ||
download and move data to the right directory, one can do it | ||
automatically by implementing function ``download()``. | ||
|
||
If the dataset is a zip file, make ``MyDataset`` inherit from | ||
:class:`dgl.data.DGLBuiltinDataset` class, which handles the zip file extraction for us. Otherwise, | ||
one needs to implement ``download()`` like in :class:`~dgl.data.QM7bDataset`: | ||
|
||
.. code:: | ||
import os | ||
from dgl.data.utils import download | ||
def download(self): | ||
# path to store the file | ||
file_path = os.path.join(self.raw_dir, self.name + '.mat') | ||
# download file | ||
download(self.url, path=file_path) | ||
The above code downloads a .mat file to directory ``self.raw_dir``. If | ||
the file is a .gz, .tar, .tar.gz or .tgz file, use :func:`~dgl.data.utils.extract_archive` | ||
function to extract. The following code shows how to download a .gz file | ||
in :class:`~dgl.data.BitcoinOTCDataset`: | ||
|
||
.. code:: | ||
from dgl.data.utils import download, check_sha1 | ||
def download(self): | ||
# path to store the file | ||
# make sure to use the same suffix as the original file name's | ||
gz_file_path = os.path.join(self.raw_dir, self.name + '.csv.gz') | ||
# download file | ||
download(self.url, path=gz_file_path) | ||
# check SHA-1 | ||
if not check_sha1(gz_file_path, self._sha1_str): | ||
raise UserWarning('File {} is downloaded but the content hash does not match.' | ||
'The repo may be outdated or download may be incomplete. ' | ||
'Otherwise you can create an issue for it.'.format(self.name + '.csv.gz')) | ||
# extract file to directory `self.name` under `self.raw_dir` | ||
self._extract_gz(gz_file_path, self.raw_path) | ||
The above code will extract the file into directory ``self.name`` under | ||
``self.raw_dir``. If the class inherits from :class:`dgl.data.DGLBuiltinDataset` | ||
to handle zip file, it will extract the file into directory ``self.name`` | ||
as well. | ||
|
||
Optionally, one can check SHA-1 string of the downloaded file as the | ||
example above does, in case the author changed the file in the remote | ||
server some day. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
.. _guide-data-pipeline-loadogb: | ||
|
||
4.5 Loading OGB datasets using ``ogb`` package | ||
---------------------------------------------- | ||
|
||
`Open Graph Benchmark (OGB) <https://ogb.stanford.edu/docs/home/>`__ is | ||
a collection of benchmark datasets. The official OGB package | ||
`ogb <https://github.com/snap-stanford/ogb>`__ provides APIs for | ||
downloading and processing OGB datasets into :class:`dgl.data.DGLGraph` objects. The section | ||
introduce their basic usage here. | ||
|
||
First install ogb package using pip: | ||
|
||
.. code:: | ||
pip install ogb | ||
The following code shows how to load datasets for *Graph Property | ||
Prediction* tasks. | ||
|
||
.. code:: | ||
# Load Graph Property Prediction datasets in OGB | ||
import dgl | ||
import torch | ||
from ogb.graphproppred import DglGraphPropPredDataset | ||
from torch.utils.data import DataLoader | ||
def _collate_fn(batch): | ||
# batch is a list of tuple (graph, label) | ||
graphs = [e[0] for e in batch] | ||
g = dgl.batch(graphs) | ||
labels = [e[1] for e in batch] | ||
labels = torch.stack(labels, 0) | ||
return g, labels | ||
# load dataset | ||
dataset = DglGraphPropPredDataset(name='ogbg-molhiv') | ||
split_idx = dataset.get_idx_split() | ||
# dataloader | ||
train_loader = DataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, collate_fn=_collate_fn) | ||
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, collate_fn=_collate_fn) | ||
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, collate_fn=_collate_fn) | ||
Loading *Node Property Prediction* datasets is similar, but note that | ||
there is only one graph object in this kind of dataset. | ||
|
||
.. code:: | ||
# Load Node Property Prediction datasets in OGB | ||
from ogb.nodeproppred import DglNodePropPredDataset | ||
dataset = DglNodePropPredDataset(name='ogbn-proteins') | ||
split_idx = dataset.get_idx_split() | ||
# there is only one graph in Node Property Prediction datasets | ||
g, labels = dataset[0] | ||
# get split labels | ||
train_label = dataset.labels[split_idx['train']] | ||
valid_label = dataset.labels[split_idx['valid']] | ||
test_label = dataset.labels[split_idx['test']] | ||
*Link Property Prediction* datasets also contain one graph per dataset: | ||
|
||
.. code:: | ||
# Load Link Property Prediction datasets in OGB | ||
from ogb.linkproppred import DglLinkPropPredDataset | ||
dataset = DglLinkPropPredDataset(name='ogbl-ppa') | ||
split_edge = dataset.get_edge_split() | ||
graph = dataset[0] | ||
print(split_edge['train'].keys()) | ||
print(split_edge['valid'].keys()) | ||
print(split_edge['test'].keys()) |
Oops, something went wrong.