Skip to content

Commit cef3409

Browse files
mariosaengerMario Sängerleonweber
authored
Add review guide (#905)
* Initial version review guide * Update REVIEW_GUIDE.md * Fix illegal HF_ORG in hubtools.py --------- Co-authored-by: Mario Sänger <[email protected]> Co-authored-by: Leon Weber <[email protected]>
1 parent 318f0c7 commit cef3409

File tree

5 files changed

+158
-17
lines changed

5 files changed

+158
-17
lines changed

REVIEW_GUIDE.md

+87
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Review Guide
2+
3+
This guide explains what steps a project administrator needs to perform when updating a dataset
4+
implementation.
5+
6+
## Checkout PR
7+
First, checkout the pull-request to obtain a local copy using the [GitHub CLI](https://cli.github.com/):
8+
```
9+
gh pr checkout PULL-REQUEST
10+
```
11+
12+
## Basic checks
13+
14+
To ensure the highest possible standard and uniformity of the data set implementations, please check the following dataset requirements.
15+
16+
- The dataset should be implemented in `bigbio/hub/hub_repos/<dataset>` and contain (at least) the three default
17+
files `<dataset>.py`, `bigbiohub.py` and `README.md`.
18+
- Check whether all data set meta-data are given in `<dataset>.py` and `README.md`. Refer to
19+
[BC5CDR](bigbio/hub/hub_repos/bc5cdr/) for an example of a complete set of information.
20+
- The dataset should **not** import `bigbio` but instead use `.bigbiohub`.
21+
22+
23+
## Run unit tests
24+
25+
Run the following command from the top level of the `biomedical` repo (i.e. the same directory that contains the `requirements.txt` file).
26+
Check if the new or updated dataloader satisfies our unit tests as follows by using this command in the terminal:
27+
28+
```bash
29+
python -m tests.test_bigbio_hub <dataset_name> [--data_dir /path/to/local/data] --test_local
30+
```
31+
32+
Note, you **MUST** include the `--test_local` flag to specifically test the script for your PR, otherwise
33+
the script will default to downloading a dataloader script from the Hub. Your particular dataset may
34+
require use of some of the other command line args in the test script (ex: `--data_dir` for dataloaders
35+
that read local files).
36+
<br>
37+
To view full usage instructions you can use the `--help` command:
38+
39+
```bash
40+
python -m tests.test_bigbio --help
41+
```
42+
This will explain the types of arguments you may need to test for. A brief annotation is as such:
43+
44+
- `dataset_name`: Name of the dataset you want to test
45+
- `data_dir`: The location of the data for datasets where `LOCAL_ = True`
46+
- `config_name`: Name of the configuration you want to test. By default, the script will test all configs, but if you can use this to debug a specific split, or if your data is prohibitively large.
47+
- `ishub`: Use this when unit testing scripts that are not yet uploaded to the hub (this is True for most cases)
48+
49+
If any (severe) errors occur, report these to the PR author.
50+
51+
## Merge the PR
52+
If all previous checks could be performed successfully merge the PR into main branch:
53+
```
54+
gh pr merge PULL-REQUEST
55+
```
56+
57+
## Update hub dataset
58+
After updating the GitHub repository, the data set needs to be updated in the [BigBio Huggingface
59+
datasets hub](https://huggingface.co/bigbio). For this, first create or retrieve an API access token
60+
of your huggingface account:
61+
62+
[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
63+
64+
Note to create a token with write access:
65+
66+
![Screenshot HF access token](docs/_static/img/acess_token.png)
67+
68+
69+
Run the following command from the top-level of the repository to update the data set in the hub:
70+
```
71+
HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset>
72+
```
73+
74+
If the PR is concerned with a completely new data set add the option `-c` for creating a new data
75+
set repo in the hub first:
76+
```
77+
HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset> -c
78+
```
79+
80+
Moreover, you can test your upload command by first running a dry-run using option `-d`:
81+
```
82+
HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset> -d
83+
```
84+
85+
After running the command, visit hub webpage of the data set and check if the model card and the data
86+
viewer are displayed and the files are updated correctly, e.g.
87+
[https://huggingface.co/datasets/bigbio/bc5cdr](https://huggingface.co/datasets/bigbio/bc5cdr)

bigbio/hub/hubtools.py

+38
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
"""
44

55
import os
6+
import subprocess
7+
8+
from pathlib import Path
9+
610
from huggingface_hub import create_repo
711
from huggingface_hub import HfApi
812

@@ -11,6 +15,10 @@
1115
HF_DATASETS_URL_BASE = "https://huggingface.co/datasets"
1216

1317

18+
def get_git_revision_short_hash() -> str:
19+
return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD']).decode('ascii').strip()
20+
21+
1422
def list_datasets(full=False):
1523
"""List datasets
1624
@@ -43,6 +51,36 @@ def create_repository(dataset_name, private=True):
4351
create_repo(repo_id, repo_type="dataset", private=private)
4452

4553

54+
def update_dataset(dataset_name: str, create_repo: bool = False, dryrun: bool = True):
55+
local_dir = Path(f"bigbio/hub/hub_repos/{dataset_name}")
56+
if not local_dir.exists():
57+
raise AssertionError(f"Local directory {local_dir} doesn't exist")
58+
59+
repo_id = os.path.join(HF_ORG, dataset_name)
60+
61+
api = HfApi()
62+
if create_repo:
63+
if not dryrun:
64+
print(f"Creating repository {repo_id}")
65+
api.create_repo(repo_id, repo_type="dataset")
66+
else:
67+
print(f"DRYRUN: Creating repo {repo_id}")
68+
69+
git_hash = get_git_revision_short_hash()
70+
71+
if not dryrun:
72+
print(f"Uploading {local_dir} to {repo_id}")
73+
api.upload_folder(
74+
folder_path=str(local_dir),
75+
repo_id=repo_id,
76+
repo_type="dataset",
77+
commit_message=f"Update {dataset_name} based on git version {git_hash}",
78+
commit_description=f"Update {dataset_name} based on git version {git_hash}",
79+
)
80+
else:
81+
print(f"DRYRUN: Uploading local folder {local_dir} to {repo_id}")
82+
83+
4684
def upload_bigbiohub(dataset_names=None, dryrun=True):
4785
"""Upload bigbiohub.py to one or more hub dataset repositories.
4886

bigbio/hub/upload.py

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
"""
2+
Upload local bigbiohub.py to hub repos.
3+
"""
4+
import argparse
5+
import hubtools
6+
7+
8+
if __name__ == "__main__":
9+
parser = argparse.ArgumentParser()
10+
parser.add_argument(
11+
"dataset",
12+
type=str,
13+
help="Name of the dataset to be uploaded",
14+
)
15+
parser.add_argument(
16+
"-c", "--create",
17+
action="store_true",
18+
help="Set this flag if a the repo should be created first (before uploading files)",
19+
)
20+
parser.add_argument(
21+
"-d", "--dryrun",
22+
action="store_true",
23+
help="Set this flag to test your command without uploading to the hub",
24+
)
25+
args = parser.parse_args()
26+
27+
hubtools.update_dataset(
28+
dataset_name=args.dataset,
29+
create_repo=args.create,
30+
dryrun=args.dryrun
31+
)
32+
33+

bigbio/hub/upload_bigbiohub.py

-17
This file was deleted.

docs/_static/img/acess_token.png

130 KB
Loading

0 commit comments

Comments
 (0)