CS-TAG

CS-TAG is a project to share the public text-attributed graph (TAG) datasets and benchmark the performance of the different baseline methods. We welcome more to share datasets that are valuable for TAGs research.

Datasets 🔔

We collect and construct 8 TAG datasets from ogbn-arxiv, amazon, dblp and goodreads. Now you can go to the 'Files and version' in CSTAG to find the datasets we upload! In each dataset folder, you can find the csv file (which save the text attribute of the dataset), pt file (which represent the dgl graph file), and the Feature folder (which save the text embedding we extract from the PLM). You can use the node initial feature we created, and you also can extract the node feature from our code. For a more detailed and clear process, please clik there.😎

Environments

You can quickly install the corresponding dependencies

conda env create -f environment.yml

Pipeline 🎮

We describe below how to use our repository to perform the experiments reported in the paper. We are also adjusting the style of the repository to make it easier to use. (Please complete the 'Datasets and Feature part' above first)

1. GNN for Node Classification/Link Prediction

You can use 'ogbn-arxiv', 'Children', 'History', 'Fitness', 'Photo', 'Computers', 'webkb-cornell', 'webkb-texas', 'webkb-washington' and 'webkb-wisconsin' for the '--data_name'.

python GNN/GNN.py --data_name=Photo --dropout=0.2 --lr=0.005 --model_name=SAGE --n-epochs=1000 --n-hidden=256 --n-layers=3 --n-runs=5 --use_PLM=data/CSTAG/Photo/Feature/Photo_roberta_base_512_cls.npy

python GNN/GNN_Link.py --use_PLM=data/CSTAG/Photo/Feature/Photo_roberta_base_512_cls.npy --path=data/CSTAG/Photo/LinkPrediction/ --graph_path=data/CSTAG/Photo/Photo.pt --gnn_model=GCN

2. PLM for Classification Tasks

CUDA_VISIBLE_DEVICES=0,1 /usr/bin/env python sweep/dist_runner.py LMs/trainLM.py --att_dropout=0.1 --cla_dropout=0.1 --dataset=Computers_RS --dropout=0.1 --epochs=4 --eq_batch_size=180 --eval_patience=20000 --grad_steps=1 --label_smoothing_factor=0.1 --lr=4e-05 --model=Deberta --per_device_bsz=60 --per_eval_bsz=1000 --train_ratio=0.2 --val_ratio=0.1 --warmup_epochs=1 --gpus=0,1 --wandb_name OFF --wandb_id OFF

3. TMLM for PreTraining

for update and debug

4. TDK for PreTraining

for update and debug

5. TCL for PreTraining

CUDA_VISIBLE_DEVICES=0,1 /usr/bin/env python sweep/dist_runner.py LMs/Train_Command/train_CL.py --PrtMode=TCL --att_dropout=0.1 --cla_dropout=0.1 --dataset=Photo_RS --dropout=0.1 --epochs=5 --eq_batch_size=60 --per_device_bsz=15 --grad_steps=2 --lr=5e-05 --model=Bert --warmup_epochs=1 --gpus=0,1 --cache_dir=exp/TCL/Photo/Bert_base/

6. TMDC for Training

for update and debug

Create Your Model

If you want to add your own model to this code base, you can follow the steps below:

Add your GNN model:

In GNN/model/GNN_library, define your model (you can refer to the code for models like GCN, GAT, etc.)
In the args_init() function in GNN/model/GNN_arg.py, check to see if it contains all the parameters involved in your model. If there are deficiencies, you can easily add new parameters to this function.
Import the model you defined in GNN/GNN.py and add your corresponding model to the gen_model() function. You can then run the corresponding code to perform the node classification task.

Add your PLM model:

Go to the LM/Model/ path and create a folder named after your model name. Define init.py and config.py in it (see how these two files are defined in other folders).
Add the parameters you need to the parser() function in lm_utils.
If your model can't be loaded from huggingface, please pass in the path to the folder your model corresponds to via the parameter 'pretrain_path'.

Main experiments in CS-TAG

Representation learning on the TAGs often depend on the two type models: Graph Neural Networks and Language Models. For the latter, we often use the Pretrained Language Models (PLMs) to encode the text. For the GNNs, we follow the DGL toolkit and implement them in the GNN library. For the PLMs, we follow the huggingface trainer to implement the PLMs in a same pipeline. We know that there are no absolute fair between the two type baselines.

Citation

If you use our datasets, please consider citing our work:

@article{yan2023comprehensive,
  title={A Comprehensive Study on Text-attributed Graphs: Benchmarking and Rethinking},
  author={Yan, Hao and Li, Chaozhuo and Long, Ruosong and Yan, Chao and Zhao, Jianan and Zhuang, Wenwen and Yin, Jun and Zhang, Peiyan and Han, Weihao and Sun, Hao and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  pages={17238--17264},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 417 Commits
FeatureExtractor		FeatureExtractor
GNN		GNN
LMs		LMs
Visualize		Visualize
data/webkb		data/webkb
sweep		sweep
LLM.png		LLM.png
README.md		README.md
environment.yml		environment.yml
gen_shell_env.py		gen_shell_env.py
run_ogb_baselines.sh		run_ogb_baselines.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS-TAG

Datasets 🔔

Environments

Pipeline 🎮

1. GNN for Node Classification/Link Prediction

2. PLM for Classification Tasks

3. TMLM for PreTraining

4. TDK for PreTraining

5. TCL for PreTraining

6. TMDC for Training

Create Your Model

Main experiments in CS-TAG

Citation

About

Releases

Packages

Languages

sktsherlock/TAG-Benchmark

Folders and files

Latest commit

History

Repository files navigation

CS-TAG

Datasets 🔔

Environments

Pipeline 🎮

1. GNN for Node Classification/Link Prediction

2. PLM for Classification Tasks

3. TMLM for PreTraining

4. TDK for PreTraining

5. TCL for PreTraining

6. TMDC for Training

Create Your Model

Main experiments in CS-TAG

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages