This is an implementation of Planetoid, a graph-based semi-supervised learning method proposed in the following paper:
Revisiting Semi-Supervised Learning with Graph Embeddings. Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov. ICML 2016.
We include the Citeseer dataset in the directory data
, where the data structures needed are pickled.
To run the transductive version,
python test_trans.py
To run the inductive version,
python test_ind.py
You can refer to test_trans.py
and test_ind.py
for example usages of our model.
The models are implemented mainly in trans_model.py
(transductive) and ind_model.py
(inductive), with inheritance from base_model.py
. You might refer to the source files for detailed API documentation.
The input to the transductive model contains:
x
, the feature vectors of the training instances,y
, the one-hot labels of the training instances,graph
, adict
in the format{index: [index_of_neighbor_nodes]}
, where the neighbor nodes are organized as a list. The current version only supports binary graphs.
Let L be the number of training instances. The indices in graph
from 0 to L - 1 must correspond to the training instances, with the same order as in x
.
The input to the inductive model contains:
x
, the feature vectors of the labeled training instances,y
, the one-hot labels of the labeled training instances,allx
, the feature vectors of both labeled and unlabeled training instances (a superset ofx
),graph
, adict
in the format{index: [index_of_neighbor_nodes]}.
Let n be the number of both labeled and unlabeled training instances. These n instances should be indexed from 0 to n - 1 in graph
with the same order as in allx
.
Datasets for Citeseet, Cora, and Pubmed are available in the directory data
, in a preprocessed format stored as numpy/scipy files.
The dataset for DIEL is available at http://www.cs.cmu.edu/~lbing/data/emnlp-15-diel/emnlp-15-diel.tar.gz. We also provide a much more succinct version of the dataset that only contains necessary files and some (not very well-organized) pre-processing code here at http://cs.cmu.edu/~zhiliny/data/diel_data.tar.gz.
The NELL dataset can be found here at http://www.cs.cmu.edu/~zhiliny/data/nell_data.tar.gz.
In addition to x
, y
, allx
, and graph
as described above, the preprocessed datasets also include:
tx
, the feature vectors of the test instances,ty
, the one-hot labels of the test instances.
Refer to test_ind.py
and test_trans.py
for the definition of different hyper-parameters (passed as arguments). Hyper-parameters are tuned by randomly shuffle the training/test split (i.e., randomly shuffling the indices in x
, y
, tx
, ty
, and graph
). For the DIEL dataset, we tune the hyper-parameters on one of the ten runs, and then keep the same hyper-parameters for all the ten runs.