bigscience-workshop
diff --git a/‎REVIEW_GUIDE.md
+87 b/‎REVIEW_GUIDE.md
+87
diff --git a/‎bigbio/hub/hubtools.py
+38 b/‎bigbio/hub/hubtools.py
+38
diff --git a/‎bigbio/hub/upload.py
+33 b/‎bigbio/hub/upload.py
+33
diff --git a/‎bigbio/hub/upload_bigbiohub.py
-17 b/‎bigbio/hub/upload_bigbiohub.py
-17
diff --git a/‎docs/_static/img/acess_token.png
130 KB b/‎docs/_static/img/acess_token.png
130 KB
@@ -0,0 +1,87 @@
+# Review Guide
+
+This guide explains what steps a project administrator needs to perform when updating a dataset 
+implementation.
+
+## Checkout PR
+First, checkout the pull-request to obtain a local copy using the [GitHub CLI](https://cli.github.com/):
+```
+gh pr checkout PULL-REQUEST
+```
+
+## Basic checks
+
+To ensure the highest possible standard and uniformity of the data set implementations, please check the following dataset requirements.
+
+- The dataset should be implemented in `bigbio/hub/hub_repos/<dataset>` and contain (at least) the three default 
+files `<dataset>.py`, `bigbiohub.py` and `README.md`.
+- Check whether all data set meta-data are given in `<dataset>.py` and `README.md`. Refer to 
+[BC5CDR](bigbio/hub/hub_repos/bc5cdr/) for an example of a complete set of information.
+- The dataset should **not** import `bigbio` but instead use `.bigbiohub`. 
+
+
+## Run unit tests
+
+Run the following command from the top level of the `biomedical` repo (i.e. the same directory that contains the `requirements.txt` file). 
+Check if the new or updated dataloader satisfies our unit tests as follows by using this command in the terminal:
+
+```bash
+python -m tests.test_bigbio_hub <dataset_name> [--data_dir /path/to/local/data] --test_local
+```
+
+Note, you **MUST** include the `--test_local` flag to specifically test the script for your PR, otherwise 
+the script will default to downloading a dataloader script from the Hub. Your particular dataset may 
+require use of some of the other command line args in the test script (ex: `--data_dir` for dataloaders 
+that read local files).
+<br>
+To view full usage instructions you can use the `--help` command:
+
+```bash
+python -m tests.test_bigbio --help
+```
+This will explain the types of arguments you may need to test for. A brief annotation is as such:
+
+- `dataset_name`: Name of the dataset you want to test
+- `data_dir`: The location of the data for datasets where `LOCAL_ = True`
+- `config_name`: Name of the configuration you want to test. By default, the script will test all configs, but if you can use this to debug a specific split, or if your data is prohibitively large.
+- `ishub`: Use this when unit testing scripts that are not yet uploaded to the hub (this is True for most cases)
+
+If any (severe) errors occur, report these to the PR author.
+
+## Merge the PR
+If all previous checks could be performed successfully merge the PR into main branch:
+```
+gh pr merge PULL-REQUEST
+```
+
+## Update hub dataset
+After updating the GitHub repository, the data set needs to be updated in the [BigBio Huggingface 
+datasets hub](https://huggingface.co/bigbio). For this, first create or retrieve an API access token
+of your huggingface account:
+
+[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
+
+Note to create a token with write access:
+
+![Screenshot HF access token](docs/_static/img/acess_token.png)
+
+
+Run the following command from the top-level of the repository to update the data set in the hub:
+```
+HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN>  python bigbio/hub/upload.py <dataset>
+```
+
+If the PR is concerned with a completely new data set add the option `-c` for creating a new data
+set repo in the hub first:
+```
+HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN>  python bigbio/hub/upload.py <dataset> -c
+```
+
+Moreover, you can test your upload command by first running a dry-run using option `-d`:
+```
+HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN>  python bigbio/hub/upload.py <dataset> -d
+```
+
+After running the command, visit hub webpage of the data set and check if the model card and the data 
+viewer are displayed and the files are updated correctly, e.g. 
+[https://huggingface.co/datasets/bigbio/bc5cdr](https://huggingface.co/datasets/bigbio/bc5cdr)
@@ -3,6 +3,10 @@
 """
 
 import os
+import subprocess
+
+from pathlib import Path
+
 from huggingface_hub import create_repo
 from huggingface_hub import HfApi
 
@@ -11,6 +15,10 @@
 HF_DATASETS_URL_BASE = "https://huggingface.co/datasets"
 
 
+def get_git_revision_short_hash() -> str:
+    return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD']).decode('ascii').strip()
+
+
 def list_datasets(full=False):
     """List datasets
 
@@ -43,6 +51,36 @@ def create_repository(dataset_name, private=True):
     create_repo(repo_id, repo_type="dataset", private=private)
 
 
+def update_dataset(dataset_name: str, create_repo: bool = False, dryrun: bool = True):
+    local_dir = Path(f"bigbio/hub/hub_repos/{dataset_name}")
+    if not local_dir.exists():
+        raise AssertionError(f"Local directory {local_dir} doesn't exist")
+
+    repo_id = os.path.join(HF_ORG, dataset_name)
+
+    api = HfApi()
+    if create_repo:
+        if not dryrun:
+            print(f"Creating repository {repo_id}")
+            api.create_repo(repo_id, repo_type="dataset")
+        else:
+            print(f"DRYRUN: Creating repo {repo_id}")
+
+    git_hash = get_git_revision_short_hash()
+
+    if not dryrun:
+        print(f"Uploading {local_dir} to {repo_id}")
+        api.upload_folder(
+            folder_path=str(local_dir),
+            repo_id=repo_id,
+            repo_type="dataset",
+            commit_message=f"Update {dataset_name} based on git version {git_hash}",
+            commit_description=f"Update {dataset_name} based on git version {git_hash}",
+        )
+    else:
+        print(f"DRYRUN: Uploading local folder {local_dir} to {repo_id}")
+
+
 def upload_bigbiohub(dataset_names=None, dryrun=True):
     """Upload bigbiohub.py to one or more hub dataset repositories.
 
 
@@ -0,0 +1,33 @@
+"""
+Upload local bigbiohub.py to hub repos.
+"""
+import argparse
+import hubtools
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "dataset",
+        type=str,
+        help="Name of the dataset to be uploaded",
+    )
+    parser.add_argument(
+        "-c", "--create",
+        action="store_true",
+        help="Set this flag if a the repo should be created first (before uploading files)",
+    )
+    parser.add_argument(
+        "-d", "--dryrun",
+        action="store_true",
+        help="Set this flag to test your command without uploading to the hub",
+    )
+    args = parser.parse_args()
+
+    hubtools.update_dataset(
+        dataset_name=args.dataset,
+        create_repo=args.create,
+        dryrun=args.dryrun
+    )
+
+