This repository implements a tool for generating graphs with an arbitrary size, including node and edge tabular features.
Synthetic data generation has become pervasive with imploding amounts of data and demand to deploy machine learning models leveraging such data. There has been an increasing interest in leveraging graph-based neural network model on graph datasets, though many public datasets are of a much smaller scale than that used in real-world applications. Synthetic Graph Generation is a common problem in multiple domains for various applications, including the generation of big graphs with similar properties to original or anonymizing data that cannot be shared. The Synthetic Graph Generation tool enables users to generate arbitrary graphs based on provided real data.
The tool has the following architecture.
The module is composed of three parts: a structural generator, which fits the graph structure, feature generator, which fits the feature distribution contained in the graph; and finally, an aligner, which aligns the generated features with the generated graph structure
The graph structural generator fits graph structure and generate a corresponding graph containing the nodes and edges.
The feature generator fits the feature distribution contained in the graph and generates the corresponding features. There is the option to allow users to generate features associated with nodes, edges, or both.
The aligner aligns the generated features taken from the feature generator with the graph structure generated by a graph structural generator.
By default, the synthetic graph generation tool generates a random graph with random features specified by the user.
This tool supports the following features:
Feature | Synthetic Graph Generation |
---|---|
Non-partite graph generation | Yes |
Bipartite graph generation | Yes |
N-partite graph generation | No |
Undirected graph generation | Yes |
Directed graph generation | Yes |
Self-loops generation | Yes |
Edge features generation | Yes |
Node features generation | Yes |
-
Non-partite graph generation is a task to generate a graph that doesn't contain any explicit partites (disjoint and independent sets of nodes).
-
Bipartite graph generation is a task to generate a graph that consists of two partites.
-
N-partite graph generation is a task to generate a graph that consists of an arbitrary number of partites.
-
Undirected graph generation is a task to generate a graph made up of a set of vertices connected by not ordered edges.
-
Directed graph generation is a task to generate a graph made up of a set of vertices connected by directed edges.
-
Self-loops generation is a task to generate edges that connect a vertex to itself.
-
Edge features generation is a task to generate features associated with an edge.
-
Node features generation is a task to generate features associated with a node.
Structural graph generation
- RMAT
- Random (Erdos-Renyi)
Tabular features
- CTGAN (Conditional GAN)
- CTAB
- KDE
- Gaussian
- Random (Uniform)
Aligner
- XGBoost
- Random
The following section lists the requirements you need to run the Synthetic Graph Generation tool.
This repository contains a Dockerfile that extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- NVIDIA Ampere Architecture, NVIDIA Volta or NVIDIA Turing based GPU
- NVIDIA Docker
- Custom Docker containers built for this tool. Refer to the steps in the Quick Start Guide.
For more information about how to get started with NGC containers, refer to the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
For those unable to set up the required environment or create your own container, refer to the versioned NVIDIA Container Support Matrix.
To use the tool, perform the following steps. For the specifics concerning generation and training, refer to the Advanced section.
- Clone the repository.
git clone https://github.com/NVIDIA/DeepLearningExamples
- Go to the
SyntheticGraphGeneration
tool directory within theDeepLearningExamples
repository:
cd DeepLearningExamples/Tools/DGLPyTorch/SyntheticGraphGeneration
- Build the SyntheticGraphGeneration container.
bash docker_scripts/build_docker.sh
- Download the datasets. (It is advisable to run this command inside docker interactive container to ensure environment setup, see 6.1)
bash scripts/get_datasets.sh
Note: This script requires a manual download of 4 datasets (tabformer, ieee, paysim, credit) and putting them into ./data
directory with the correct naming. The instruction for the manual download will be printed during the preprocessing. If the raw data is not present or the dataset is already preprocessed, the preprocessing will be skipped.
5. Run the SyntheticGraphGeneration Jupyter notebook.
5.1. Run the Docker notebook container.
bash docker_scripts/run_docker_notebook.sh
5.2 Open Jupyter notebook.
http://localhost:9916/tree/demos
- Run the SyntheticGraphGeneration CLI.
6.1. Run the Docker interactive container.
bash docker_scripts/run_docker_interactive.sh
6.2. Run Command Line Interface (CLI) command.
The tool contains two run commands: synthesize
and pretrain
For example, to synthesize a graph similar to the IEEE dataset, run the following command:
syngen synthesize \
--synthesizer static_bipartite_graph \
--preprocessing ieee \
--aligner xg_boost \
--graph-generator rmat_bipartite \
--gg-seed 42 \
--edge-generator ctgan \
--eg-batch-size 2000 \
--eg-epochs 10 \
--num-nodes-src-set 17091 \
--num-nodes-dst-set 198 \
--num-edges-src-dst 52008 \
--num-edges-dst-src 52008 \
--data-path '/workspace/data/ieee-fraud/data.csv' \
--save-path '/workspace/ieee/' \
--features-to-correlate-edge "{\"TransactionAmt\": \"continuous\"}"
In this example run, a graph of similar size as the IEEE dataset is generated using the fitted synthesizer.
Note: In the above command, the static_bipartite_graph
synthesizer is used, along with ctgan
as the edge-generator and xg_boost
aligner for assigning features to nodes. Accordingly, --data-path
points to the location where the IEEE dataset is extracted, and --save-path
points to the location where the generated data is saved.
Following the above command, the pretrain
command can be used to pre-train or fine-tune the given generated sample.
syngen pretrain \
--model gat_ec \
--hidden-dim 64 \
--out-dim 32 \
--n-layers 1 \
--n-heads 2 \
--weight-decay 0.0 \
--learning-rate 0.0005 \
--batch-size 256 \
--pretrain-epochs 5 \
--finetune-epochs 5 \
--data-path '/workspace/data/ieee-fraud/data.csv' \
--pretraining-path '/workspace/ieee/' \
--preprocessing ieee \
--task ec \
--target-col isFraud \
--num-classes 2
Note: The current set of tasks and models are solely provided as use case examples on how to use the generated synthetic data to pretrain/fine-tune on a downstream task, and generally would need extension/modifications to accomodate very large graphs or arbitrary models.
For the complete CLI usage of the synthesize
command run:
syngen synthesize --help
Similarly for the pretrain
command run:
syngen pretrain --help
.
├── demos # Directory with all the Jupyter examples
├── docker_scripts # Directory with Docker scripts
├── scripts # Directory with datasets scripts
├── syngen # Directory with Synthetic Graph Generation source code
│ ├── analyzer # Directory with tools for getting graph visualisation and statistics
│ │ ├── graph # Directory with graph structure analyzer
│ │ └── tabular # Directory with tabular features analyzer
│ ├── benchmark # Directory with pretraining tools
│ │ ├── data_loader # Directory with pre-defined node and edge classification datasets
│ │ ├── models # Directory with GNN model definitions
│ │ └── tasks # Directory with set of tasks that are supported for training
│ ├── generator # Directory with all the generators
│ │ ├── graph # Directory with graph generators and graph
│ │ ├── tabular # Directory with tabular generators
│ │ | ├── data_transformer # Directory with tabular data transformations used by generators
│ │ | └── transforms # Directory with tabular column transforms
│ ├── graph_aligner # Directory with all the aligners
│ ├── preprocessing # Directory with the preprocessings for the supported datasets
│ │ └── datasets # Directory with example dataset preprocessing scripts used to generate data
│ ├── synthesizer # Directory with all the synthesizers
│ └── utils # Directory with the utilities
│ └── types # Directory with common data types used in the tool
scripts/get_datasets.sh
- Bash script downloading and preprocessing supported datastesdocker_scripts/build_docker.sh
- Bash script that builds the Docker imagedocker_scripts/run_docker_notebook.sh
- Bash script that runs Jupyter notebook in the Docker containerdocker_scripts/run_docker_interactive.sh
- Bash script that runs the Docker container in interactive modesyngen/__main__.py
- Python script that defines Synthetic Graph Generation CLIsyngen/synthesizer/static_graph_synthesizer.py
- Python file with non-partite graph synthesizersyngen/synthesizer/static_bipartite_graph_synthesizer.py
- Python file with bipartite graph synthesizer
For the synthesizer, refer to the parameters in the following table.
Scope | parameter | Comment | Default Value |
---|---|---|---|
synthesize | -s | --synthesizer SYNTHESIZER | Synthesizer to use. Available synthesizers: ['static_bipartite_graph', 'static_graph', 'random'] | Required |
synthesize | -dp | --data-path DATA_PATH | Path to dataset | None |
synthesize | -pp | --preprocessing PREPROCESSING | Preprocessing object to use, add custom preprocessing to datasets available datasets: ['cora', 'paysim', 'credit', 'tabformer', 'ieee', 'ratings'] | None |
synthesize | -sp | --save-path SAVE_PATH | Save path to dump generated files | Current directory |
synthesize | -a | --aligner ALIGNER | Aligner to use. Available aligners: ['random', 'xg_boost'] | None |
synthesize | -gg | --graph-generator GRAPH_GENERATOR | Graph generator to use to generate graph structure ['rmat', 'rmat_bipartite', 'random_graph', 'random_bipartite'] | None |
synthesize | -eg | --edge-generator EDGE_GENERATOR | Edge generator to use to generate edge features ['kde', 'kde_sk', 'uniform', 'gaussian', 'ctgan'] | None |
synthesize | -ng | --node-generator NODE_GENERATOR | Node generator to use to generate node features ['kde', 'kde_sk', 'uniform', 'gaussian', 'ctgan'] | None |
synthesize | --num-workers NUM_WORKER | Number of workers | 1 |
synthesize | --num-nodes-src-set NUM_NODES_SRC_SE | Number of nodes to generate in the source set. Applies to StaticBipartiteGraphSynthesizer . |
None |
synthesize | --num-nodes-dst-set NUM_NODES_DST_SE | Number of nodes to generate in the destination set. Applies to StaticBipartiteGraphSynthesizer . |
None |
synthesize | --num-edges-src-dst NUM_EDGES_SRC_DS | Number of edges to generate from the source set to the destination set. Applies to StaticBipartiteGraphSynthesizer . |
None |
synthesize | --num-edges-dst-src NUM_EDGES_DST_SR | Number of edges to generate from the destination set to the source set. Applies to StaticBipartiteGraphSynthesizer . |
None |
synthesize | --num-nodes NUM_NODE | Number of nodes to generate for non-partite synthesizer. Applies to StaticGraphSynthesizer . |
None |
synthesize | --num-edges NUM_EDGE | Number of edges to generate for non-partite synthesizer. Applies to StaticGraphSynthesizer . |
None |
synthesize | --edge-dim EDGE_DIM | Edge feature dimension to generate. Applies to RandomSynthesizer |
None |
synthesize | --g-bipartite G_BIPARTITE | Generates random bipartite graph. Applies to RandomSynthesizer |
None |
synthesize | --node-dim NODE_DIM | Node feature dimension to generate. Applies to RandomSynthesizer |
None |
synthesize | --features-to-correlate-node FEATURES_TO_CORRELATE_NODES | Node feature columns to use to train XGBoostAligner . Must be provided in a dict format {: }, where is an enum of type ColumnType (refer to syngen/utils/types/column_type.py). |
None |
synthesize | --features-to-correlate-edge FEATURES_TO_CORRELATE_EDGES | Edge feature columns to use to train XGBoostAligner . Must be provided in a dict format {: }, where is an enum of type ColumnType (refer to syngen/utils/types/column_type.py). |
None |
For the pretraining refer to the Command-line options, as the parameters depend on the model choice.
To display the full list of available options and their descriptions, use the -h
or --help
command-line option:
syngen --help
The tool currently support the synthesize
and pretrain
commands. To display the full list of available options for the respective command run:
syngen <command> --help
In this example, we show how to define the synthesizer pipeline for IEEE dataset. A full example can be found in ieee_notebook.
- Feature generator is used to generate tabular features. For the final graph, we use only edge features as there are no node features in the IEEE dataset. In this example, we use the CTGAN generator.
edge_feature_generator = CTGANGenerator(epochs=10, batch_size=2000, verbose=False)
- Structure generator is used to generate graph structure. In this example, we use RMAT implementation
static_graph_generator = RMATBipartiteGenerator()
- Preprocessing is required for all datasets to work. For custom datasets, users need to create their own preprocessing following the base API. In the repository, we provide implementations for all the supported datasets.
preprocessing = IEEEPreprocessing(cached=False)
- Aligner is necessary to properly align generated structure with tabular features, as those two processes are independent of each other. In this example, we use an XGBoost aligner and specify which features to correlate to which structure.
graph_aligner = XGBoostAligner(features_to_correlate_edge={'TransactionAmt': ColumnType.CONTINUOUS})
- Synthesizer is a class that combines all the generators and allows the user to run end-to-end fitting and generation. We use
StaticBipartiteGraphSynthesizer
because ieee is a bipartite dataset. To select whether to run the computation on GPU or not, use thegpu
flag. By settingis_directed=True
, we say that the graph is undirected.
synthesizer = StaticBipartiteGraphSynthesizer(
graph_generator=static_graph_generator,
graph_info=preprocessing.graph_info,
edge_feature_generator=edge_feature_generator,
graph_aligner=graph_aligner,
is_directed=False,
gpu=True)
- To generate a graph, first we need to extract graph data (structure and features) from the preprocessing. This can be done by calling provided
.transform
method on the dataset path.
data = preprocessing.transform(dataset_path)
- To run fitting for all the generators, we use the
fit
method provided by the synthesizer. We pass only edge_data as ieee is a bipartite dataset with edge features only.
synthesizer.fit(edge_data=data[MetaData.EDGE_DATA])
- To run generation, we call the
generate
method provided by the synthesizer. We can provide the number of nodes in both partites and the number of edges for each direction. In our case, this is the same number because we specified that we have an undirected graph during synthesizer instantiation.
data_proper = synthesizer.generate(num_nodes_src_set,
num_nodes_dst_set,
num_edges_src_dst,
num_edges_dst_src)
To download the datasets used as an example , use get_datasets.sh
script
bash scripts/get_datasets.sh
Note: Certain datasets require a Kaggle API key, hence may require manual download. Refer to the links below. Note: Each user is responsible for checking the content of datasets and the applicable licenses and determining if they are suitable for the intended use
Supported datasets:
Our results were obtained by running the demo notebooks directory in the PyTorch NGC container on NVIDIA DGX1 V100 with 8x V100 32GB GPUs. All the notebooks are presented in the table below.
scope | notebook | description | |
---|---|---|---|
1. | basic_examples | er_demo.ipynb | generating different types of random graphs using Erdős–Rényi model |
2. | basic_examples | ieee_demo.ipynb | generating a bipartite graph structure based on provided edge list |
3. | basic_examples | cora_demo.ipynb | generating a non-bipartite graph structure based on provided edge list |
4. | advanced_examples | e2e_cora_demo.ipynb | a complete process of reconstructing and analyzing a non-bipartite graph dataset with node features |
5. | advanced_examples | e2e_ieee_demo.ipynb | a complete process of reconstructing and analyzing a bipartite graph dataset with edge features |
6. | advanced_examples | frechet_lastfm_demo.ipynb | SynGen non-bipartite graph structure generators scaling analysis |
7. | advanced_examples | frechet_tabformer_demo.ipynb | SynGen bipartite graph structure generators scaling analysis |
8. | advanced_examples | edge_classification_pretraining.ipynb | using synthetic data from SynGen for GNN pretraining |
9. | performance | struct_generator.ipynb | comparison of SynGen graph structure generators |
10. | performance | tabular_generator.ipynb | comparison of SynGen tabular data generators |
Scope refers to the directories in which the notebooks are stored and the functionalities particular notebooks cover . There are
- Basic - basic_examples - notebooks with the examples of basics functionalities
- Advanced - advanced_examples - notebooks with the examples of advanced functionalities
- Performance - performance - notebooks with the performance experiments
To achieve the same results, follow the steps in the Quick Start Guide.
The quality of the content comparison was conducted on the IEEE dataset (refer to List of datasets for more details) with corresponding notebook e2e_ieee_demo.ipynb We compared three modalities, that is, quality of generated graph structure, quality of generated tabular data and quality of aligning tabular data to the graph structure.
-
Graph structure quality
-
Tabular data quality
-
Structure to tabular alignment quality
-
Performance of categorical tabular data generation (samples/s)
Dataset KDE KDE_SK Uniform Gaussian CTGAN ieee 1420600 2640973 1768136 1509729 33202
-
Performance (batches/s) comparison between original vs. synthetic datasets
Dataset Model Synthetic Original ieee gat 0.07173 0.07249
January 2023
- Initial release
There are no known issues with this model.
Cite the following paper if you find this code useful or use it in your own work:
@article{darabi2022framework,
title={A Framework for Large Scale Synthetic Graph Dataset Generation},
author={Darabi, Sajad and Bigaj, Piotr and Majchrowski, Dawid and Morkisz, Pawel and Fit-Florea, Alex},
journal={arXiv preprint arXiv:2210.01944},
year={2022}
}