Error on resuming training after preemption #964

eu9ene · 2024-12-17T23:15:05Z

https://firefox-ci-tc.services.mozilla.com/tasks/Tiqin_UwSJSd8YDULbaV0g/runs/1/logs/public/logs/live.log

There are a bunch of things happened here:

It couldn't download the checkpoint:

[task 2024-12-17T16:14:21.902Z] INFO:root:Fetching public/build/model.npz...
[task 2024-12-17T16:14:22.133Z] ERROR:root:Got 4xx error for public/build/model.npz, run 1 is not resumable...

It appears to not install the python packages:

[task 2024-12-17T16:14:15.347Z] Requirement already satisfied: sentencepiece==0.1.99
...

How did it get those already installed? For example in CI we can see the logs when they are being installed:

[task 2024-11-22T23:43:12.368Z] Collecting sentencepiece==0.1.99 (from -r ./checkouts/vcs/pipeline/train/requirements/train.txt (line 19))
[task 2024-11-22T23:43:12.385Z]   Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
[task 2024-11-22T23:43:22.363Z] Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[task 2024-11-22T23:43:23.925Z] Installing collected packages: typo, sentencepiece, tqdm, regex, pyyaml, joblib, click, sacremoses, opustrainer

The initialization of OpusTrainer failed. It was not the case for the first attempt. The model was training fine which we can see in the train.log. Due to this the whole task failed.

[task 2024-12-17T16:38:34.130Z] opustrainer.trainer.CurriculumLoaderError: could not initialize modifier 'Tags': PlaceholderTagModifier.__init__() got an unexpected keyword argument 'tag'

Some errors related to parsing likely due to previous error:

[tracking ERROR] Publication failed! The error is ignored to not break training, but it should be fixed.

We can't see the Taskcluster log for the first attempt. Should we reopen the issue?

@bhearsum I need your help to figure out the situation with packages. I assume it somehow pulled the wrong package from the cache. Then OpusTrainer used the old version and couldn't use my fix where I added the tag parameter.

The text was updated successfully, but these errors were encountered:

bhearsum · 2024-12-18T16:53:55Z

I'm going to separate this slightly. First, a bit on the termination of run 0, and then about run 1 not resuming properly.

As a quick tl;dr - we definitely did not try to resume in run 1. run_command_pipeline appears to be failing quickly, presumably because of the opustrainer command. It's not clear to me why that's happening. model.npz.best-chrf.npz being missing appears to be fallout from this, not the original error.

Run 0

I was able to dig up worker logs for run 0. Most notably we have:

Dec 17 16:03:58Z translations-1-b-linux-v100-gpu-4-1tb-vvppjagzspgzcf58mfvmzw-1 start-worker 2024/12/17 16:03:58 GCP Metadata Service says termination is imminent
Dec 17 16:04:09Z translations-1-b-linux-v100-gpu-4-1tb-vvppjagzspgzcf58mfvmzw-1 systemd-logind Power key pressed.

So, this was most definitely a spot termination. And it looks to me like we didn't have time to upload all of the artifacts before we got the hard poweroff. (Otherwise the log and other artifacts would be available.) There may be two things influencing this:

This happened to be an instance in Europe, and it's plausible that uploads took longer because of this.
GCP appears to have given us much more limited time to handle the termination than is typical. ("The shutdown period for a preemption notice is best effort and up to 30 seconds." is what they advertise, but this is the first I've witnessed it to be anything less than 10 seconds.)

Run 1

As far as I can tell, run 1 did not try to resume, which is what I would expect in this situation. If it had, we would expect to see --pretrained-model in the arguments given to train.py.

The full traceback that caused the task to fail is:

[task 2024-12-17T16:38:34.308Z] [tracking ERROR] Publication failed! The error is ignored to not break training, but it should be fixed.
[task 2024-12-17T16:38:34.308Z] Traceback (most recent call last):
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/parser.py", line 332, in parse_marian_context
[task 2024-12-17T16:38:34.308Z]     headers, text = next(logs_iter)
[task 2024-12-17T16:38:34.308Z] StopIteration
[task 2024-12-17T16:38:34.308Z] 
[task 2024-12-17T16:38:34.308Z] During handling of the above exception, another exception occurred:
[task 2024-12-17T16:38:34.308Z] 
[task 2024-12-17T16:38:34.308Z] Traceback (most recent call last):
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/cli/taskcluster.py", line 174, in main
[task 2024-12-17T16:38:34.308Z]     boot()
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/cli/taskcluster.py", line 165, in boot
[task 2024-12-17T16:38:34.308Z]     parser.run()
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/parser.py", line 449, in run
[task 2024-12-17T16:38:34.308Z]     self.parse()
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/parser.py", line 412, in parse
[task 2024-12-17T16:38:34.308Z]     self.parse_marian_context(copy)
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/parser.py", line 335, in parse_marian_context
[task 2024-12-17T16:38:34.308Z]     raise ValueError("Could not find a [marian] entry in the training log.")
[task 2024-12-17T16:38:34.308Z] ValueError: Could not find a [marian] entry in the training log.
[task 2024-12-17T16:38:36.270Z] Traceback (most recent call last):
[task 2024-12-17T16:38:36.270Z]   File "/home/ubuntu/tasks/task_173445166741057/checkouts/vcs/pipeline/train/train.py", line 475, in <module>
[task 2024-12-17T16:38:36.270Z]     main()
[task 2024-12-17T16:38:36.270Z]   File "/home/ubuntu/tasks/task_173445166741057/checkouts/vcs/pipeline/train/train.py", line 471, in main
[task 2024-12-17T16:38:36.270Z]     train_cli.run_training()
[task 2024-12-17T16:38:36.270Z]   File "/home/ubuntu/tasks/task_173445166741057/checkouts/vcs/pipeline/train/train.py", line 375, in run_training
[task 2024-12-17T16:38:36.271Z]     shutil.copy(
[task 2024-12-17T16:38:36.271Z]   File "/usr/lib/python3.10/shutil.py", line 417, in copy
[task 2024-12-17T16:38:36.297Z]     copyfile(src, dst, follow_symlinks=follow_symlinks)
[task 2024-12-17T16:38:36.297Z]   File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
[task 2024-12-17T16:38:36.297Z]     with open(src, 'rb') as fsrc:
[task 2024-12-17T16:38:36.297Z] FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/tasks/task_173445166741057/artifacts/model.npz.best-chrf.npz'
[task 2024-12-17T16:38:36.388Z] Traceback (most recent call last):
[task 2024-12-17T16:38:36.388Z]   File "/home/ubuntu/tasks/task_173445166741057/./checkouts/vcs/taskcluster/scripts/pipeline/train_taskcluster.py", line 137, in <module>
[task 2024-12-17T16:38:36.388Z]     main(sys.argv[1:])
[task 2024-12-17T16:38:36.388Z]   File "/home/ubuntu/tasks/task_173445166741057/./checkouts/vcs/taskcluster/scripts/pipeline/train_taskcluster.py", line 133, in main
[task 2024-12-17T16:38:36.388Z]     subprocess.run([TRAINING_SCRIPT, *script_args], check=True)
[task 2024-12-17T16:38:36.388Z]   File "/usr/lib/python3.10/subprocess.py", line 526, in run
[task 2024-12-17T16:38:36.422Z]     raise CalledProcessError(retcode, process.args,
[task 2024-12-17T16:38:36.422Z] subprocess.CalledProcessError: Command '['/home/ubuntu/tasks/task_173445166741057/./checkouts/vcs/taskcluster/scripts/pipeline/train-taskcluster.sh', 'teacher', 'train', 'zh', 'en', '/home/ubuntu/tasks/task_173445166741057/fetches/corpus.tok-icu,/home/ubuntu/tasks/task_173445166741057/fetches/mono.tok-icu', '/home/ubuntu/tasks/task_173445166741057/fetches/devset', '/home/ubuntu/tasks/task_173445166741057/artifacts', 'chrf', '/home/ubuntu/tasks/task_173445166741057/fetches/corpus.aln.zst,/home/ubuntu/tasks/task_173445166741057/fetches/mono.aln.zst', '1', 'two-stage', 'None', 'None', 'None', '--early-stopping', '20']' returned non-zero exit status 1.

The first half is about W&B reporting, and appears not to be fatal based on the "The error is ignored..." message. The second traceback (that starts on line 475 of train.py) appears to be the fatal part. Specifically:

[task 2024-12-17T16:38:36.297Z] FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/tasks/task_173445166741057/artifacts/model.npz.best-chrf.npz'

That comes from this piece of code, which looks like it is supposed to run after opustrainer and marian have run.

Specific questions

To reply to your specific questions as well:

1. It couldn't download the checkpoint:

[task 2024-12-17T16:14:21.902Z] INFO:root:Fetching public/build/model.npz...
[task 2024-12-17T16:14:22.133Z] ERROR:root:Got 4xx error for public/build/model.npz, run 1 is not resumable...

Covered above

2. It appears to not install the python packages:

[task 2024-12-17T16:14:15.347Z] Requirement already satisfied: sentencepiece==0.1.99
...

How did it get those already installed? For example in CI we can see the logs when they are being installed:

[task 2024-11-22T23:43:12.368Z] Collecting sentencepiece==0.1.99 (from -r ./checkouts/vcs/pipeline/train/requirements/train.txt (line 19))
[task 2024-11-22T23:43:12.385Z]   Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
[task 2024-11-22T23:43:22.363Z] Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[task 2024-11-22T23:43:23.925Z] Installing collected packages: typo, sentencepiece, tqdm, regex, pyyaml, joblib, click, sacremoses, opustrainer

This just means the package was already there; this worker previously ran tasks, so it had things like this cached already:

3. The initialization of OpusTrainer failed. It was not the case for the first attempt. The model was training fine which we can see in the train.log. Due to this the whole task failed.

[task 2024-12-17T16:38:34.130Z] opustrainer.trainer.CurriculumLoaderError: could not initialize modifier 'Tags': PlaceholderTagModifier.__init__() got an unexpected keyword argument 'tag'

4. Some errors related to parsing likely due to previous error:

[tracking ERROR] Publication failed! The error is ignored to not break training, but it should be fixed.

I'm a bit suspicious that W&B exiting is causing the entire pipeline to fail. This seems to be run through run_command_pipeline; it's not immediately clear to me whether or not the W&B parser exiting will cause this to return or not.

5. We can't see the [Taskcluster log](https://firefox-ci-tc.services.mozilla.com/tasks/Tiqin_UwSJSd8YDULbaV0g/runs/0/logs/public/logs/live.log) for the first attempt. Should we reopen the issue?

Covered above.

eu9ene · 2024-12-18T17:35:49Z

Thank you for clarifying things! The mystery for me here is still the OpusTrainer error. Do you see the same error in the log for the Run 0? If not, it does mess up package versions. I wonder if your new docker images will solve this.

bhearsum · 2024-12-18T17:42:05Z

Thank you for clarifying things! The mystery for me here is still the OpusTrainer error. Do you see the same error in the log for the Run 0? If not, it does mess up package versions. I wonder if your new docker images will solve this.

Unfortunately, the log from run 0 is unrecoverable :(. The things I pasted above were from syslog, which is preserved.

bhearsum · 2024-12-19T00:02:47Z

We dug more into this on Matrix. What's going on here is that the task that failed used a opustrainer wheel from a previous task, which was built from a different revision:

[task 2024-12-17T15:49:32.122Z] Collecting opustrainer@ git+https://github.com/hplt-project/OpusTrainer.git@c966d7b353d6b3c6a09d9573f1ab6ba3221c1d21 (from -r ./checkouts/vcs/pipeline/train/requirements/train.txt (line 11))
[task 2024-12-17T15:49:50.039Z] Created wheel for opustrainer: filename=opustrainer-0.2-py3-none-any.whl size=43619 sha256=afeb9bd0ca2464da9e9cc80521e3f43bf6ae843c86f7235e3ab0ebf2911e35f1

(From https://firefox-ci-tc.services.mozilla.com/tasks/IKcqat0UQ6yVUBYJBDDbcA)

This happens because we're pinning opustrainer to a revision, but the wheel is getting a built with a version number in its filename. When the subsequent task ran, it saw an opustrainer 0.2 wheel, and simply used that rather than rebuilding.

This sucks, and sounds like a pip bug.

One thing this does mean is that spot termination was not a factor here.

A few suggestions as to how to proceed:

Bump the opustrainer version when you make changes to it
See if pip has a way to include the rev in the wheel name

eu9ene added the bug label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on resuming training after preemption #964

Error on resuming training after preemption #964

eu9ene commented Dec 17, 2024

bhearsum commented Dec 18, 2024

eu9ene commented Dec 18, 2024

bhearsum commented Dec 18, 2024

bhearsum commented Dec 19, 2024

Error on resuming training after preemption #964

Error on resuming training after preemption #964

Comments

eu9ene commented Dec 17, 2024

bhearsum commented Dec 18, 2024

Run 0

Run 1

Specific questions

eu9ene commented Dec 18, 2024

bhearsum commented Dec 18, 2024

bhearsum commented Dec 19, 2024