user guide korean version (dmlc#3559)

Co-authored-by: zhjwy9343 <[email protected]>
ammuaj · Dec 2, 2021 · 67f9314 · 67f9314
1 parent d2ef243
commit 67f9314
Show file tree

Hide file tree

Showing 44 changed files with 4,954 additions and 0 deletions.
diff --git a/docs/source/guide_ko/data-dataset.rst b/docs/source/guide_ko/data-dataset.rst
@@ -0,0 +1,88 @@
+.. _guide_ko-data-pipeline-dataset:
+
+4.1 DGLDataset 클래스
+--------------------
+
+:ref:`(English Version) <guide-data-pipeline-dataset>`
+
+:class:`~dgl.data.DGLDataset` 는 :ref:`apidata` 에서 정의된 그래프 데이터셋을 프로세싱하고, 로딩하고 저장하기 위한 기본 클래스이다. 이는 그래프 데이트를 서치하는 기본 파이프라인을 구현한다. 아래 순서도는 파이프라인이 어떻게 동작하는지를 보여준다.
+
+.. figure:: https://data.dgl.ai/asset/image/userguide_data_flow.png
+    :align: center
+
+    DGLDataset 클래스에 정의된 그래프 데이터 입력 파이프라인에 대한 순서도
+
+
+원격 또는 로컬 디스크에 있는 그래프 데이터셋을 처리하기 위해서, :class:`dgl.data.DGLDataset` 를 상속해서 클래스를 정의하나. 예로, ``MyDataset`` 이라고 하자. ``MyDataset`` 템플릿은 다음과 같다.
+
+.. code:: 
+
+    from dgl.data import DGLDataset
+    
+    class MyDataset(DGLDataset):
+        """ Template for customizing graph datasets in DGL.
+    
+        Parameters
+        ----------
+        url : str
+            URL to download the raw dataset
+        raw_dir : str
+            Specifying the directory that will store the 
+            downloaded data or the directory that
+            already stores the input data.
+            Default: ~/.dgl/
+        save_dir : str
+            Directory to save the processed dataset.
+            Default: the value of `raw_dir`
+        force_reload : bool
+            Whether to reload the dataset. Default: False
+        verbose : bool
+            Whether to print out progress information
+        """
+        def __init__(self, 
+                     url=None, 
+                     raw_dir=None, 
+                     save_dir=None, 
+                     force_reload=False, 
+                     verbose=False):
+            super(MyDataset, self).__init__(name='dataset_name',
+                                            url=url,
+                                            raw_dir=raw_dir,
+                                            save_dir=save_dir,
+                                            force_reload=force_reload,
+                                            verbose=verbose)
+    
+        def download(self):
+            # download raw data to local disk
+            pass
+    
+        def process(self):
+            # process raw data to graphs, labels, splitting masks
+            pass
+        
+        def __getitem__(self, idx):
+            # get one example by index
+            pass
+    
+        def __len__(self):
+            # number of data examples
+            pass
+    
+        def save(self):
+            # save processed data to directory `self.save_path`
+            pass
+    
+        def load(self):
+            # load processed data from directory `self.save_path`
+            pass
+    
+        def has_cache(self):
+            # check whether there are processed data in `self.save_path`
+            pass
+
+:class:`~dgl.data.DGLDataset` 클래스에는 서브클래스에서 꼭 구현되어야 하는 함수들 ``process()`` ,
+``__getitem__(idx)`` 와 ``__len__()`` 이 있다. 또한 DGL은 저장과 로딩을 구현하는 것을 권장하는데, 그 이유는 큰 데이터셋 처리 시간을 많이 줄일 수 있고, 이를 쉽게 구현하는데 필요한 API들이 있기 때문이다. (:ref:`guide-data-pipeline-savenload` 참고)
+
+:class:`~dgl.data.DGLDataset` 의 목적은 그래프 데이터 로드에 필요한 편리하고 표준적인 방법을 제공하는 것이다. 그래프, 피쳐, 레이블, 그리고 데이터셋에 대한 기본적인 정보 (클래스 개수, 레이블 개수 등)을 저장할 수 있다. 샘플링, 파티셔닝 또는 파쳐 normalization과 같은 작업은 :class:`~dgl.data.DGLDataset` 의 서브클래스 밖에서 수행된다.
+
+이 장의 나머지에서는 파이프라인에서 함수를 구현하는 best practice들을 소개한다.
diff --git a/docs/source/guide_ko/data-download.rst b/docs/source/guide_ko/data-download.rst
@@ -0,0 +1,45 @@
+.. _guide_ko-data-pipeline-download:
+
+4.2 Raw 데이터 다운로드하기 (optional)
+---------------------------------
+
+:ref:`(English Version) <guide-data-pipeline-download>`
+
+로컬 디스크에 데이터셋이 이미 존재한다면, ``raw_dir`` 디렉토리에 있어야 한다. 만약 데이터를 다운로드하고 특정 디렉토리에 옮기는 일을 직접 수행하지 않고 코드를 실행하고 어디서나 실행하고 싶다면, ``download()`` 구현해서 이를 자동화할 수 있다.
+
+데이터셋이 zip 파일 포멧인 경우, zip 파일 추출을 자동을 해주는 :class:`dgl.data.DGLBuiltinDataset` 클래스를 상속해서 ``MyDataset`` 클래스를 만들자. 그렇지 않은 경우 :class:`~dgl.data.QM7bDataset` 처럼 ``download()`` 함수를 직접 구현한다:
+
+.. code:: 
+
+    import os
+    from dgl.data.utils import download
+    
+    def download(self):
+        # path to store the file
+        file_path = os.path.join(self.raw_dir, self.name + '.mat')
+        # download file
+        download(self.url, path=file_path)
+
+위 코드는 .mat 파일을 ``self.raw_dir`` 디렉토리에 다운로드한다. 만약 파일 포멧이 .gz, .tar, .tar.gz 또는 .tgz 이라면, :func:`~dgl.data.utils.extract_archive` 함수로 파일들을 추출하자. 다음 코드는 :class:`~dgl.data.BitcoinOTCDataset` 에서 .gz 파일을 다운로드하는 예이다:
+
+.. code:: 
+
+    from dgl.data.utils import download, check_sha1
+    
+    def download(self):
+        # path to store the file
+        # make sure to use the same suffix as the original file name's
+        gz_file_path = os.path.join(self.raw_dir, self.name + '.csv.gz')
+        # download file
+        download(self.url, path=gz_file_path)
+        # check SHA-1
+        if not check_sha1(gz_file_path, self._sha1_str):
+            raise UserWarning('File {} is downloaded but the content hash does not match.'
+                              'The repo may be outdated or download may be incomplete. '
+                              'Otherwise you can create an issue for it.'.format(self.name + '.csv.gz'))
+        # extract file to directory `self.name` under `self.raw_dir`
+        self._extract_gz(gz_file_path, self.raw_path)
+
+위 코드는 ``self.raw_dir`` 디렉토리 아래의 ``self.name`` 서브 디렉토리에 파일을 추출한다. 만약 zip 파일을 다루기 위해서 :class:`dgl.data.DGLBuiltinDataset` 를 상속해서 사용했다면, 파일들은 자동으로 ``self.name`` 디렉토리로 추출될 것이다.
+
+추가적으로, 다운로드한 파일에 대한 SHA-1 값 검증을 수행해서 파일이 변경되었는지 확인하는 것도 위 예제처럼 구현할 수 있다.
diff --git a/docs/source/guide_ko/data-loadogb.rst b/docs/source/guide_ko/data-loadogb.rst
@@ -0,0 +1,73 @@
+.. _guide_ko-data-pipeline-loadogb:
+
+4.5 ``ogb`` 패키지를 사용해서 OGB 데이터셋들 로드하기
+-------------------------------------------
+
+:ref:`(English Version) <guide-data-pipeline-loadogb>`
+
+`Open Graph Benchmark (OGB) <https://ogb.stanford.edu/docs/home/>`__ 은 벤치마킹 데이터셋의 모음이다. 공식 OGB 패키지 `ogb <https://github.com/snap-stanford/ogb>`__ 는 OBG 데이터셋들을 다운로드해서 :class:`dgl.data.DGLGraph` 객체로 프로세싱하는 API들을 제공한다. 이 절은 기본적인 사용법을 설명한다.
+
+우선 obg 패키지를 pip 명령으로 설치한다.
+
+.. code:: 
+
+    pip install ogb
+
+다음 코드는 *Graph Property Prediction* 테스크를 위한 데이터셋 로딩 방법을 보여준다.
+
+.. code:: 
+
+    # Load Graph Property Prediction datasets in OGB
+    import dgl
+    import torch
+    from ogb.graphproppred import DglGraphPropPredDataset
+    from dgl.dataloading import GraphDataLoader
+    
+    
+    def _collate_fn(batch):
+        # batch is a list of tuple (graph, label)
+        graphs = [e[0] for e in batch]
+        g = dgl.batch(graphs)
+        labels = [e[1] for e in batch]
+        labels = torch.stack(labels, 0)
+        return g, labels
+    
+    # load dataset
+    dataset = DglGraphPropPredDataset(name='ogbg-molhiv')
+    split_idx = dataset.get_idx_split()
+    # dataloader
+    train_loader = GraphDataLoader(dataset[split_idx["train"]], batch_size=32, shuffle=True, collate_fn=_collate_fn)
+    valid_loader = GraphDataLoader(dataset[split_idx["valid"]], batch_size=32, shuffle=False, collate_fn=_collate_fn)
+    test_loader = GraphDataLoader(dataset[split_idx["test"]], batch_size=32, shuffle=False, collate_fn=_collate_fn)
+
+*Node Property Prediction* 데이터셋을 로딩하는 것이 비슷하지만, 이런 종류의 데이터셋은 오직 한 개의 그래프 객체만 존재한다는 것이 다름을 유의하자.
+
+.. code:: 
+
+    # Load Node Property Prediction datasets in OGB
+    from ogb.nodeproppred import DglNodePropPredDataset
+    
+    dataset = DglNodePropPredDataset(name='ogbn-proteins')
+    split_idx = dataset.get_idx_split()
+    
+    # there is only one graph in Node Property Prediction datasets
+    g, labels = dataset[0]
+    # get split labels
+    train_label = dataset.labels[split_idx['train']]
+    valid_label = dataset.labels[split_idx['valid']]
+    test_label = dataset.labels[split_idx['test']]
+
+*Link Property Prediction* 데이터셋 역시 데이터셋에 한개의 그래프를 갖고 있다.
+
+.. code:: 
+
+    # Load Link Property Prediction datasets in OGB
+    from ogb.linkproppred import DglLinkPropPredDataset
+    
+    dataset = DglLinkPropPredDataset(name='ogbl-ppa')
+    split_edge = dataset.get_edge_split()
+    
+    graph = dataset[0]
+    print(split_edge['train'].keys())
+    print(split_edge['valid'].keys())
+    print(split_edge['test'].keys())