Skip to content

Commit

Permalink
Cleanup arguments and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
cthoyt committed Feb 22, 2022
1 parent 9106abf commit 5f2c7d2
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 39 deletions.
77 changes: 50 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,27 +10,32 @@ the [KG Course](https://github.com/migalkin/kgcourse2021).
## Datasets

<img alt="A schematic diagram of inductive link prediction"
src="https://pykeen.readthedocs.io/en/latest/_images/ilp_1.png"
height="200" align="right"
src="https://pykeen.readthedocs.io/en/latest/_images/ilp_1.png"
height="200" align="right"
/>

While in *transductive* link prediction, the training and inference graph
are the same (and therefore contain the same entities), in *inductive* link
While in *transductive* link prediction, the training and inference graph are
the same (and therefore contain the same entities), in *inductive* link
prediction, there is a disjoint inference graph that potentially contains new,
unseen entities.

TODO: give background on the dataset - where does it come from and how was it
constructed

Both the small and large variants of the dataset can be found in the
[`data`](data) folder of this repository. Each contains four splits corresponding
to the diagram:
[`data`](data) folder of this repository. Each contains four splits
corresponding to the diagram:

* `train.txt` - the training graph on which you are supposed to train a model
* `inference.txt` - the inductive inference graph **disjoint** with the training one - that is, it has a new non-overlapping set of entities, the missing links are sampled from this graph
* `inductive_validation.txt` - validation set of triples to predict, uses entities from the **inference** graph
* `inductive_test.txt` - test set of triples to predict, uses entities from the **inference** graph
* a hold-out test set of triples - kept by the organizers for the final ranking 😉 , uses entities from the **inference** graph
* `inference.txt` - the inductive inference graph **disjoint** with the training
one - that is, it has a new non-overlapping set of entities, the missing links
are sampled from this graph
* `inductive_validation.txt` - validation set of triples to predict, uses
entities from the **inference** graph
* `inductive_test.txt` - test set of triples to predict, uses entities from
the **inference** graph
* a hold-out test set of triples - kept by the organizers for the final ranking
😉 , uses entities from the **inference** graph

### [GALK78k](data/small)

Expand All @@ -56,7 +61,8 @@ to the diagram:

TODO: explain the challenge philosophy, how to participate, etc.

We use the following [rank-based evaluation metrics](https://pykeen.readthedocs.io/en/stable/tutorial/understanding_evaluation.html):
We use the
following [rank-based evaluation metrics](https://pykeen.readthedocs.io/en/stable/tutorial/understanding_evaluation.html):

* MRR (Inverse Harmonic Mean Rank) - higher is better
* Hits @ K (H@K; with K as one of `{1, 3, 5, 10, 100}`) - higher is better
Expand All @@ -68,17 +74,22 @@ Making a submission:
1. Fork the repo
2. Train your inductive link prediction model
3. Save the model weights using the `--save` flag
4. Upload model weights on GitHub or other platforms (Dropbox, Google Drive, etc)
5. Open an issue in **this** repo with the link to your repo, performance metrics, and model weights
4. Upload model weights on GitHub or other platforms (Dropbox, Google Drive,
etc)
5. Open an issue in **this** repo with the link to your repo, performance
metrics, and model weights

## Baselines

We provide an example workflow in [`main.py`](main.py) for training and
evaluating two variants of the [NodePiece](https://arxiv.org/abs/2106.12144)
model using [PyKEEN](https://github.com/pykeen/pykeen):

* `InductiveNodePiece` - plain tokenizer + tokens MLP encoder to bootstrap node representations. Fast.
* `InductiveNodePieceGNN` - everything above + an additional 2-layer [CompGCN](https://arxiv.org/abs/1911.03082) message passing encoder. Slower but performs better.
* `InductiveNodePiece` - plain tokenizer + tokens MLP encoder to bootstrap node
representations. Fast.
* `InductiveNodePieceGNN` - everything above + an additional
2-layer [CompGCN](https://arxiv.org/abs/1911.03082) message passing encoder.
Slower but performs better.

The example can be run with `python main.py` and the options can be listed
with `python main.py --help`.
Expand All @@ -98,21 +109,25 @@ For more information on the models check out the [PyKEEN tutorial](https://pykee
<summary>Installation Instructions</summary>

Main requirements:

* python >= 3.9
* torch >= 1.10

You will need PyKEEN 1.8.0 or newer.

```shell
$ pip install pykeen
```

By the time of creation of this repo 1.8.0 is not yet there, but the latest version from sources contains
everything we need
By the time of creation of this repo 1.8.0 is not yet there, but the latest
version from sources contains everything we need

```shell
$ pip install git+https://github.com/pykeen/pykeen.git
```

If you plan to use GNNs (including the `InductiveNodePieceGNN` baseline) make sure you install [torch-scatter](https://github.com/rusty1s/pytorch_scatter)
If you plan to use GNNs (including the `InductiveNodePieceGNN` baseline) make
sure you install [torch-scatter](https://github.com/rusty1s/pytorch_scatter)
and [torch-geometric](https://github.com/pyg-team/pytorch_geometric)
compatible with your python, torch, and CUDA versions.

Expand All @@ -125,13 +140,16 @@ Running the code on a GPU is strongly recommended.
We report the performance of both variants of the NodePiece model on the small
variant of the dataset after running the following:

* InductiveNodePieceGNN (32d, 50 epochs, 24K params) - NodePiece (5 tokens per node, MLP aggregator) + 2-layer CompGCN with DistMult composition function + DistMult decoder. Training time: **77 min***
* InductiveNodePieceGNN (32d, 50 epochs, 24K params) - NodePiece (5 tokens per
node, MLP aggregator) + 2-layer CompGCN with DistMult composition function +
DistMult decoder. Training time: **77 min***
```shell
$ python main.py -dim 32 -e 50 -negs 16 -m 2.0 -lr 0.0001 --gnn
$ python main.py --dataset small -d 32 -e 50 -n 16 -m 2.0 -lr 0.0001 --gnn
```
* InductiveNodePiece (32d, 50 epochs, 15.5K params) - NodePiece (5 tokens per node, MLP aggregator) + DistMult decoder. Training time: **6 min***
* InductiveNodePiece (32d, 50 epochs, 15.5K params) - NodePiece (5 tokens per
node, MLP aggregator) + DistMult decoder. Training time: **6 min***
```shell
$ python main.py -dim 32 -e 50 -negs 16 -m 5.0 -lr 0.0001
$ python main.py --dataset small -d 32 -e 50 -n 16 -m 5.0 -lr 0.0001
```

| **Model** | MRR | H@100 | H@10 | H@5 | H@3 | H@1 | MR | AMR |
Expand All @@ -144,18 +162,23 @@ variant of the dataset after running the following:
We report the performance of both variants of the NodePiece model on the large
variant of the dataset after running the following:

* InductiveNodePieceGNN (32d, 53 epochs, 24K params) - NodePiece (5 tokens per node, MLP aggregator) + 2-layer CompGCN with DistMult composition function + DistMult decoder. Training time: **8 hours***
* InductiveNodePieceGNN (32d, 53 epochs, 24K params) - NodePiece (5 tokens per
node, MLP aggregator) + 2-layer CompGCN with DistMult composition function +
DistMult decoder. Training time: **8 hours***
```shell
$ python main.py -dim 32 -e 53 -negs 16 -m 20.0 -lr 0.0001 -ds large --gnn
$ python main.py --dataset large -d 32 -e 53 -n 16 -m 20.0 -lr 0.0001 --gnn
```
* InductiveNodePiece (32d, 17 epochs, 15.5K params) - NodePiece (5 tokens per node, MLP aggregator) + DistMult decoder. Training time: **5 min***
* InductiveNodePiece (32d, 17 epochs, 15.5K params) - NodePiece (5 tokens per
node, MLP aggregator) + DistMult decoder. Training time: **5 min***
```shell
$ python main.py -dim 32 -e 17 -negs 16 -m 15.0 -lr 0.0001 -ds large
$ python main.py --dataset large -d 32 -e 17 -n 16 -m 15.0 -lr 0.0001
```

| **Model** | MRR | H@100 | H@10 | H@5 | H@3 | H@1 | MR | AMR |
|-----------------------|-------:|----------:|-----------:|-----------:|-----------:|-------:|---------:|----------:|
| InductiveNodePieceGNN | 0.0705 | **0.374** | **0.1458** | **0.0990** | **0.0730** | 0.0319 | **4566** | **0.318** |
| InductiveNodePiece | 0.0651 | 0.287 | 0.1246 | 0.0809 | 0.0542 | 0.0373 | 5078 | 0.354 |

\* Note: All models were trained on a single RTX 8000. Average memory consumption during training is about 2 GB VRAM on the `small` dataset and about 3 GB on `large`.
\* Note: All models were trained on a single RTX 8000. Average memory
consumption during training is about 2 GB VRAM on the `small` dataset and about
3 GB on `large`.
66 changes: 55 additions & 11 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,23 +28,59 @@


@click.command()
@click.option("--dataset", type=click.Choice(["small", "large"]), default="small", show_default=True)
@click.option(
"--embedding-dim", type=int, default=100, show_default=True, help="The dimension of the entity embeddings"
"--dataset",
type=click.Choice(["small", "large"]),
default="small",
show_default=True,
)
@click.option(
"-d",
"--embedding-dim",
type=int,
default=100,
show_default=True,
help="The dimension of the entity embeddings",
)
@click.option(
"-t",
"--tokens",
type=int,
default=5,
show_default=True,
help="Number of tokens to use in NodePiece",
)
@click.option("--tokens", type=int, default=5, show_default=True, help="Number of tokens to use in NodePiece")
@click.option("-lr", "--learning-rate", type=float, default=0.0005, show_default=True)
@click.option(
"-m", "--margin", type=float, default=15.0, show_default=True, help="for the margin loss and SLCWA training"
"-m",
"--margin",
type=float,
default=15.0,
show_default=True,
help="for the margin loss and SLCWA training",
)
@click.option(
"--num-negatives", type=int, default=4, show_default=True, help="negative samples per positive in the SLCWA regime"
"-n",
"--num-negatives",
type=int,
default=4,
show_default=True,
help="negative samples per positive in the SLCWA regime",
)
@click.option("-b", "--batch-size", type=int, default=256, show_default=True)
@click.option("-e", "--epochs", type=int, default=100, show_default=True, help="The number of training epochs")
@click.option(
"-e",
"--epochs",
type=int,
default=100,
show_default=True,
help="The number of training epochs",
)
@click.option("--wandb", is_flag=True, help="Track results with Weights & Biases")
@click.option("--save", is_flag=True, help=f"Save the model in the {DATA} directory")
@click.option("--gnn", is_flag=True, help="Use the Inductive NodePiece model with GCN layers")
@click.option(
"--gnn", is_flag=True, help="Use the Inductive NodePiece model with GCN layers"
)
def main(
dataset: str,
embedding_dim: int,
Expand Down Expand Up @@ -97,7 +133,10 @@ def main(
model=model,
optimizer=optimizer,
result_tracker=tracker,
negative_sampler_kwargs=dict(num_negs_per_pos=num_negatives), # affects training speed, the more - the better
negative_sampler_kwargs=dict(
# affects training speed, the more - the better
num_negs_per_pos=num_negatives
),
mode=TRAINING, # must be specified for the inductive setup
)

Expand Down Expand Up @@ -126,18 +165,23 @@ def main(
model=model,
mapped_triples=dataset.inductive_testing.mapped_triples,
additional_filter_triples=[
# filtering of other positive triples
dataset.inductive_inference.mapped_triples,
dataset.inductive_validation.mapped_triples,
], # filtering of other positive triples
],
batch_size=batch_size,
)

# extracting final metrics
results_dict = result.to_dict()
print(f"Test MRR {results_dict['inverse_harmonic_mean_rank']['both']['realistic']:.5f}")
print(
f"Test MRR {results_dict['inverse_harmonic_mean_rank']['both']['realistic']:.5f}"
)
for k in [100, 10, 5, 3, 1]:
print(f"Test Hits@{k} {results_dict['hits_at_k']['both']['realistic'][k]:.5f}")
print(f"Test Arithmetic Mean Rank {results_dict['arithmetic_mean_rank']['both']['realistic']:.5f}")
print(
f"Test Arithmetic Mean Rank {results_dict['arithmetic_mean_rank']['both']['realistic']:.5f}"
)
print(
f"Test Adjusted Arithmetic Mean Rank {results_dict['adjusted_arithmetic_mean_rank']['both']['realistic']:.5f}"
)
Expand Down
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ deps =
isort
skip_install = true
commands =
black -l 120 .
black .
isort --profile black .
description = Run linters.

0 comments on commit 5f2c7d2

Please sign in to comment.