Skip to content

Commit

Permalink
Rewrite train.sh to train.py (#842)
Browse files Browse the repository at this point in the history
* Add a run_pipeline utility

* Add more tests for training

* Rewrite train.sh into train.py

* Add the pipeline to the PYTHONPATH

* Ensure that the W&B tracker throws errors in CI

* Add the Taskcluster environment variables so test-fast works on the train test

* Address review comments
  • Loading branch information
gregtatum authored Sep 18, 2024
1 parent d7235e0 commit 9d355d8
Show file tree
Hide file tree
Showing 20 changed files with 729 additions and 216 deletions.
3 changes: 1 addition & 2 deletions docs/opus-trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ It likely will be the case when using a pre-trained student model as a backward
OpusTrainer configuration files for the trained models are located in
the [/pipeline/train/configs/opustrainer/](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/opustrainer/) directory.
`<dataset0>`, `<dataset1>` and `<vocab>` will be replaced by the training datasets and a path to Sentencepiece `vocab.spm` passed in `pipeline/train/train.sh` script.
`{dataset0}`, `{dataset1}` and `{vocab}` will be replaced by the training datasets and a path to Sentencepiece `vocab.spm` passed in `pipeline/train/train.py` script.

See more details on configuration in the OpusTrainer [readme](https://github.com/hplt-project/OpusTrainer).

Expand Down Expand Up @@ -167,4 +167,3 @@ so it should only be used on small evaluation datasets.
- flores_aug-noise_devtest
- flores_aug-inline-noise_devtest
```

8 changes: 4 additions & 4 deletions docs/training-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,10 +139,10 @@ For more details on data cleaning see the documents on [Data cleaning](cleaning.
## 4. Set hyperparameters

The pipeline supports overriding the default [Marian settings](https://marian-nmt.github.io/docs/cmd/marian/) in the training config. The default settings are in the `pipeline/train/configs` directory,
for example [`teacher.train.yml`] and in the [`train.sh`] script.
for example [`teacher.train.yml`] and in the [`train.py`] script.

[teacher.train.yml]: https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml
[train.sh]: https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh
[train.py]: https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.py

### Model training

Expand Down Expand Up @@ -224,7 +224,7 @@ Find the full description of the pipeline steps [here](pipeline-steps.md).
### Cluster specific configuaiton

The Marian workspace is usually safe to set to about 3/4 of available GPU memory
(in a [profile for Snakemake](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) and throughout the ci steps in Task cluster).
(in a [profile for Snakemake](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.py) and throughout the ci steps in Task cluster).
Setting a higher value speeds up training but might lead to out of GPU memory error.

### Taskcluster
Expand Down Expand Up @@ -319,7 +319,7 @@ Taskcluster retries automatically.

Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM.
For very high-resource languages like French it can happen even earlier, on the backward/teacher training stage.
The workaround is to remove `--shuffle-in-ram` from the [training script](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh)
The workaround is to remove `--shuffle-in-ram` from the [training script](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.py)
and add `--shuffle batches` instead.
More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21).

Expand Down
96 changes: 96 additions & 0 deletions pipeline/common/command_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import re
from shlex import join
import shlex
import subprocess


def _get_indented_command_string(command_parts: list[str]) -> str:
"""
Print out a command with the flags indented, so that it's easy to read.
"""
command = join(command_parts)
parts = re.split(r"( --\w)", command)

formatted_command = [parts[0].strip()]

for i in range(1, len(parts), 2):
option = parts[i].strip() + parts[i + 1].strip()
formatted_command.append(f" {option}")

return "\n".join(formatted_command)


def apply_command_args(dict: dict[str, any]):
"""
Takes in a dictionary, and applies the keys as command line flags.
input: { "key": "value" }
output: "--key value"
input: { "inputs": ["valueA", "valueB"] }
output: "--inputs valueA valueB"
"""

for key, value in dict.items():
yield f"--{key}"
if value is None:
continue

if isinstance(value, list):
for v in value:
yield str(v)
continue

yield str(value)


def run_command_pipeline(
commands: list[list[str]], pipe_stderr=False, capture=False, logger=None
) -> str | None:
"""
Executes a series of shell commands in a pipeline, where the output of one command
is piped to the next. Optionally captures the final output or logs the pipeline
process. It raises `CalledProcessError` if any command in the pipeline fails.
Args:
commands: A list of command arguments where each command is
represented as a list of strings.
pipe_stderr: If True, pipes `stderr` of each command into `stdout`.
capture: If True, captures and returns the output of the final command in the
pipeline. If False, output is printed to stdout. Defaults to False.
logger: A logger instance used for logging the command execution. If provided,
it will log the constructed pipeline commands. Defaults to None.
Example:
python_scripts = run_pipeline(
[
["ls", "-l"],
["grep", ".py"],
["sort"]
],
capture=True
)
"""
if pipe_stderr:
joiner = "2>&1 |"
else:
joiner = "|"

if logger:
# Log out a nice representation of this command.
final_command = _get_indented_command_string(commands[0])
for command_parts in commands[1:]:
final_command = (
f"{final_command}\n{joiner} {_get_indented_command_string(command_parts)}"
)

logger.info("Running:")
for line in final_command.split("\n"):
logger.info(line)

command_string = f" {joiner} ".join([shlex.join(command) for command in commands])

if capture:
return subprocess.check_output(command_string, shell=True).decode("utf-8")

subprocess.check_call(command_string, shell=True)
2 changes: 1 addition & 1 deletion pipeline/train/configs/opustrainer/backward.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
datasets:
original: <dataset0> # Original parallel corpus
original: {dataset0} # Original parallel corpus

stages:
- train
Expand Down
4 changes: 2 additions & 2 deletions pipeline/train/configs/opustrainer/student.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
datasets:
original: <dataset0> # Original parallel corpus
original: {dataset0} # Original parallel corpus

stages:
- train
Expand All @@ -26,7 +26,7 @@ modifiers:
# Tags modifier has to be the last one to retokenize the alignments
- Tags: 0.005
augment: 1
spm_vocab: <vocab>
spm_vocab: {vocab}

seed: 1111
# parallel sentences + token alignments
Expand Down
6 changes: 3 additions & 3 deletions pipeline/train/configs/opustrainer/teacher.one-stage.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
datasets:
original: <dataset0> # Original parallel corpus
backtranslated: <dataset1> # Back-translated data
original: {dataset0} # Original parallel corpus
backtranslated: {dataset1} # Back-translated data

stages:
- train
Expand Down Expand Up @@ -34,6 +34,6 @@ modifiers:


# random seed should be different for different teacher models
seed: <seed>
seed: {seed}
# parallel sentences + token alignments
num_fields: 3
6 changes: 3 additions & 3 deletions pipeline/train/configs/opustrainer/teacher.two-stage.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
datasets:
original: <dataset0> # Original parallel corpus
backtranslated: <dataset1> # Back-translated data
original: {dataset0} # Original parallel corpus
backtranslated: {dataset1} # Back-translated data

stages:
- pretrain
Expand Down Expand Up @@ -39,6 +39,6 @@ modifiers:


# random seed should be different for different teacher models
seed: <seed>
seed: {seed}
# parallel sentences + token alignments
num_fields: 3
Loading

0 comments on commit 9d355d8

Please sign in to comment.