|
| 1 | +# Review Guide |
| 2 | + |
| 3 | +This guide explains what steps a project administrator needs to perform when updating a dataset |
| 4 | +implementation. |
| 5 | + |
| 6 | +## Checkout PR |
| 7 | +First, checkout the pull-request to obtain a local copy using the [GitHub CLI](https://cli.github.com/): |
| 8 | +``` |
| 9 | +gh pr checkout PULL-REQUEST |
| 10 | +``` |
| 11 | + |
| 12 | +## Basic checks |
| 13 | + |
| 14 | +To ensure the highest possible standard and uniformity of the data set implementations, please check the following dataset requirements. |
| 15 | + |
| 16 | +- The dataset should be implemented in `bigbio/hub/hub_repos/<dataset>` and contain (at least) the three default |
| 17 | +files `<dataset>.py`, `bigbiohub.py` and `README.md`. |
| 18 | +- Check whether all data set meta-data are given in `<dataset>.py` and `README.md`. Refer to |
| 19 | +[BC5CDR](bigbio/hub/hub_repos/bc5cdr/) for an example of a complete set of information. |
| 20 | +- The dataset should **not** import `bigbio` but instead use `.bigbiohub`. |
| 21 | + |
| 22 | + |
| 23 | +## Run unit tests |
| 24 | + |
| 25 | +Run the following command from the top level of the `biomedical` repo (i.e. the same directory that contains the `requirements.txt` file). |
| 26 | +Check if the new or updated dataloader satisfies our unit tests as follows by using this command in the terminal: |
| 27 | + |
| 28 | +```bash |
| 29 | +python -m tests.test_bigbio_hub <dataset_name> [--data_dir /path/to/local/data] --test_local |
| 30 | +``` |
| 31 | + |
| 32 | +Note, you **MUST** include the `--test_local` flag to specifically test the script for your PR, otherwise |
| 33 | +the script will default to downloading a dataloader script from the Hub. Your particular dataset may |
| 34 | +require use of some of the other command line args in the test script (ex: `--data_dir` for dataloaders |
| 35 | +that read local files). |
| 36 | +<br> |
| 37 | +To view full usage instructions you can use the `--help` command: |
| 38 | + |
| 39 | +```bash |
| 40 | +python -m tests.test_bigbio --help |
| 41 | +``` |
| 42 | +This will explain the types of arguments you may need to test for. A brief annotation is as such: |
| 43 | + |
| 44 | +- `dataset_name`: Name of the dataset you want to test |
| 45 | +- `data_dir`: The location of the data for datasets where `LOCAL_ = True` |
| 46 | +- `config_name`: Name of the configuration you want to test. By default, the script will test all configs, but if you can use this to debug a specific split, or if your data is prohibitively large. |
| 47 | +- `ishub`: Use this when unit testing scripts that are not yet uploaded to the hub (this is True for most cases) |
| 48 | + |
| 49 | +If any (severe) errors occur, report these to the PR author. |
| 50 | + |
| 51 | +## Merge the PR |
| 52 | +If all previous checks could be performed successfully merge the PR into main branch: |
| 53 | +``` |
| 54 | +gh pr merge PULL-REQUEST |
| 55 | +``` |
| 56 | + |
| 57 | +## Update hub dataset |
| 58 | +After updating the GitHub repository, the data set needs to be updated in the [BigBio Huggingface |
| 59 | +datasets hub](https://huggingface.co/bigbio). For this, first create or retrieve an API access token |
| 60 | +of your huggingface account: |
| 61 | + |
| 62 | +[https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) |
| 63 | + |
| 64 | +Note to create a token with write access: |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | +Run the following command from the top-level of the repository to update the data set in the hub: |
| 70 | +``` |
| 71 | +HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset> |
| 72 | +``` |
| 73 | + |
| 74 | +If the PR is concerned with a completely new data set add the option `-c` for creating a new data |
| 75 | +set repo in the hub first: |
| 76 | +``` |
| 77 | +HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset> -c |
| 78 | +``` |
| 79 | + |
| 80 | +Moreover, you can test your upload command by first running a dry-run using option `-d`: |
| 81 | +``` |
| 82 | +HUGGING_FACE_HUB_TOKEN=<ACCESS-TOKEN> python bigbio/hub/upload.py <dataset> -d |
| 83 | +``` |
| 84 | + |
| 85 | +After running the command, visit hub webpage of the data set and check if the model card and the data |
| 86 | +viewer are displayed and the files are updated correctly, e.g. |
| 87 | +[https://huggingface.co/datasets/bigbio/bc5cdr](https://huggingface.co/datasets/bigbio/bc5cdr) |
0 commit comments