Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on resuming training after preemption #964

Open
eu9ene opened this issue Dec 17, 2024 · 4 comments
Open

Error on resuming training after preemption #964

eu9ene opened this issue Dec 17, 2024 · 4 comments
Labels
bug Something is broken or not correct

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Dec 17, 2024

https://firefox-ci-tc.services.mozilla.com/tasks/Tiqin_UwSJSd8YDULbaV0g/runs/1/logs/public/logs/live.log

There are a bunch of things happened here:

  1. It couldn't download the checkpoint:
[task 2024-12-17T16:14:21.902Z] INFO:root:Fetching public/build/model.npz...
[task 2024-12-17T16:14:22.133Z] ERROR:root:Got 4xx error for public/build/model.npz, run 1 is not resumable...
  1. It appears to not install the python packages:
[task 2024-12-17T16:14:15.347Z] Requirement already satisfied: sentencepiece==0.1.99
...

How did it get those already installed? For example in CI we can see the logs when they are being installed:

[task 2024-11-22T23:43:12.368Z] Collecting sentencepiece==0.1.99 (from -r ./checkouts/vcs/pipeline/train/requirements/train.txt (line 19))
[task 2024-11-22T23:43:12.385Z]   Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
[task 2024-11-22T23:43:22.363Z] Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[task 2024-11-22T23:43:23.925Z] Installing collected packages: typo, sentencepiece, tqdm, regex, pyyaml, joblib, click, sacremoses, opustrainer
  1. The initialization of OpusTrainer failed. It was not the case for the first attempt. The model was training fine which we can see in the train.log. Due to this the whole task failed.
[task 2024-12-17T16:38:34.130Z] opustrainer.trainer.CurriculumLoaderError: could not initialize modifier 'Tags': PlaceholderTagModifier.__init__() got an unexpected keyword argument 'tag'
  1. Some errors related to parsing likely due to previous error:
[tracking ERROR] Publication failed! The error is ignored to not break training, but it should be fixed.
  1. We can't see the Taskcluster log for the first attempt. Should we reopen the issue?

@bhearsum I need your help to figure out the situation with packages. I assume it somehow pulled the wrong package from the cache. Then OpusTrainer used the old version and couldn't use my fix where I added the tag parameter.

@eu9ene eu9ene added the bug Something is broken or not correct label Dec 17, 2024
@bhearsum
Copy link
Collaborator

I'm going to separate this slightly. First, a bit on the termination of run 0, and then about run 1 not resuming properly.

As a quick tl;dr - we definitely did not try to resume in run 1. run_command_pipeline appears to be failing quickly, presumably because of the opustrainer command. It's not clear to me why that's happening. model.npz.best-chrf.npz being missing appears to be fallout from this, not the original error.

Run 0

I was able to dig up worker logs for run 0. Most notably we have:

Dec 17 16:03:58Z translations-1-b-linux-v100-gpu-4-1tb-vvppjagzspgzcf58mfvmzw-1 start-worker 2024/12/17 16:03:58 GCP Metadata Service says termination is imminent
Dec 17 16:04:09Z translations-1-b-linux-v100-gpu-4-1tb-vvppjagzspgzcf58mfvmzw-1 systemd-logind Power key pressed.

So, this was most definitely a spot termination. And it looks to me like we didn't have time to upload all of the artifacts before we got the hard poweroff. (Otherwise the log and other artifacts would be available.) There may be two things influencing this:

  1. This happened to be an instance in Europe, and it's plausible that uploads took longer because of this.
  2. GCP appears to have given us much more limited time to handle the termination than is typical. ("The shutdown period for a preemption notice is best effort and up to 30 seconds." is what they advertise, but this is the first I've witnessed it to be anything less than 10 seconds.)

Run 1

As far as I can tell, run 1 did not try to resume, which is what I would expect in this situation. If it had, we would expect to see --pretrained-model in the arguments given to train.py.

The full traceback that caused the task to fail is:

[task 2024-12-17T16:38:34.308Z] [tracking ERROR] Publication failed! The error is ignored to not break training, but it should be fixed.
[task 2024-12-17T16:38:34.308Z] Traceback (most recent call last):
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/parser.py", line 332, in parse_marian_context
[task 2024-12-17T16:38:34.308Z]     headers, text = next(logs_iter)
[task 2024-12-17T16:38:34.308Z] StopIteration
[task 2024-12-17T16:38:34.308Z] 
[task 2024-12-17T16:38:34.308Z] During handling of the above exception, another exception occurred:
[task 2024-12-17T16:38:34.308Z] 
[task 2024-12-17T16:38:34.308Z] Traceback (most recent call last):
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/cli/taskcluster.py", line 174, in main
[task 2024-12-17T16:38:34.308Z]     boot()
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/cli/taskcluster.py", line 165, in boot
[task 2024-12-17T16:38:34.308Z]     parser.run()
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/parser.py", line 449, in run
[task 2024-12-17T16:38:34.308Z]     self.parse()
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/parser.py", line 412, in parse
[task 2024-12-17T16:38:34.308Z]     self.parse_marian_context(copy)
[task 2024-12-17T16:38:34.308Z]   File "/home/ubuntu/.local/lib/python3.10/site-packages/translations_parser/parser.py", line 335, in parse_marian_context
[task 2024-12-17T16:38:34.308Z]     raise ValueError("Could not find a [marian] entry in the training log.")
[task 2024-12-17T16:38:34.308Z] ValueError: Could not find a [marian] entry in the training log.
[task 2024-12-17T16:38:36.270Z] Traceback (most recent call last):
[task 2024-12-17T16:38:36.270Z]   File "/home/ubuntu/tasks/task_173445166741057/checkouts/vcs/pipeline/train/train.py", line 475, in <module>
[task 2024-12-17T16:38:36.270Z]     main()
[task 2024-12-17T16:38:36.270Z]   File "/home/ubuntu/tasks/task_173445166741057/checkouts/vcs/pipeline/train/train.py", line 471, in main
[task 2024-12-17T16:38:36.270Z]     train_cli.run_training()
[task 2024-12-17T16:38:36.270Z]   File "/home/ubuntu/tasks/task_173445166741057/checkouts/vcs/pipeline/train/train.py", line 375, in run_training
[task 2024-12-17T16:38:36.271Z]     shutil.copy(
[task 2024-12-17T16:38:36.271Z]   File "/usr/lib/python3.10/shutil.py", line 417, in copy
[task 2024-12-17T16:38:36.297Z]     copyfile(src, dst, follow_symlinks=follow_symlinks)
[task 2024-12-17T16:38:36.297Z]   File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
[task 2024-12-17T16:38:36.297Z]     with open(src, 'rb') as fsrc:
[task 2024-12-17T16:38:36.297Z] FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/tasks/task_173445166741057/artifacts/model.npz.best-chrf.npz'
[task 2024-12-17T16:38:36.388Z] Traceback (most recent call last):
[task 2024-12-17T16:38:36.388Z]   File "/home/ubuntu/tasks/task_173445166741057/./checkouts/vcs/taskcluster/scripts/pipeline/train_taskcluster.py", line 137, in <module>
[task 2024-12-17T16:38:36.388Z]     main(sys.argv[1:])
[task 2024-12-17T16:38:36.388Z]   File "/home/ubuntu/tasks/task_173445166741057/./checkouts/vcs/taskcluster/scripts/pipeline/train_taskcluster.py", line 133, in main
[task 2024-12-17T16:38:36.388Z]     subprocess.run([TRAINING_SCRIPT, *script_args], check=True)
[task 2024-12-17T16:38:36.388Z]   File "/usr/lib/python3.10/subprocess.py", line 526, in run
[task 2024-12-17T16:38:36.422Z]     raise CalledProcessError(retcode, process.args,
[task 2024-12-17T16:38:36.422Z] subprocess.CalledProcessError: Command '['/home/ubuntu/tasks/task_173445166741057/./checkouts/vcs/taskcluster/scripts/pipeline/train-taskcluster.sh', 'teacher', 'train', 'zh', 'en', '/home/ubuntu/tasks/task_173445166741057/fetches/corpus.tok-icu,/home/ubuntu/tasks/task_173445166741057/fetches/mono.tok-icu', '/home/ubuntu/tasks/task_173445166741057/fetches/devset', '/home/ubuntu/tasks/task_173445166741057/artifacts', 'chrf', '/home/ubuntu/tasks/task_173445166741057/fetches/corpus.aln.zst,/home/ubuntu/tasks/task_173445166741057/fetches/mono.aln.zst', '1', 'two-stage', 'None', 'None', 'None', '--early-stopping', '20']' returned non-zero exit status 1.

The first half is about W&B reporting, and appears not to be fatal based on the "The error is ignored..." message. The second traceback (that starts on line 475 of train.py) appears to be the fatal part. Specifically:

[task 2024-12-17T16:38:36.297Z] FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/tasks/task_173445166741057/artifacts/model.npz.best-chrf.npz'

That comes from this piece of code, which looks like it is supposed to run after opustrainer and marian have run.

Specific questions

To reply to your specific questions as well:

1. It couldn't download the checkpoint:
[task 2024-12-17T16:14:21.902Z] INFO:root:Fetching public/build/model.npz...
[task 2024-12-17T16:14:22.133Z] ERROR:root:Got 4xx error for public/build/model.npz, run 1 is not resumable...

Covered above

2. It appears to not install the python packages:
[task 2024-12-17T16:14:15.347Z] Requirement already satisfied: sentencepiece==0.1.99
...

How did it get those already installed? For example in CI we can see the logs when they are being installed:

[task 2024-11-22T23:43:12.368Z] Collecting sentencepiece==0.1.99 (from -r ./checkouts/vcs/pipeline/train/requirements/train.txt (line 19))
[task 2024-11-22T23:43:12.385Z]   Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
[task 2024-11-22T23:43:22.363Z] Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[task 2024-11-22T23:43:23.925Z] Installing collected packages: typo, sentencepiece, tqdm, regex, pyyaml, joblib, click, sacremoses, opustrainer

This just means the package was already there; this worker previously ran tasks, so it had things like this cached already:
image

3. The initialization of OpusTrainer failed. It was not the case for the first attempt. The model was training fine which we can see in the train.log. Due to this the whole task failed.
[task 2024-12-17T16:38:34.130Z] opustrainer.trainer.CurriculumLoaderError: could not initialize modifier 'Tags': PlaceholderTagModifier.__init__() got an unexpected keyword argument 'tag'
4. Some errors related to parsing likely due to previous error:
[tracking ERROR] Publication failed! The error is ignored to not break training, but it should be fixed.

I'm a bit suspicious that W&B exiting is causing the entire pipeline to fail. This seems to be run through run_command_pipeline; it's not immediately clear to me whether or not the W&B parser exiting will cause this to return or not.

5. We can't see the [Taskcluster log](https://firefox-ci-tc.services.mozilla.com/tasks/Tiqin_UwSJSd8YDULbaV0g/runs/0/logs/public/logs/live.log) for the first attempt. Should we reopen the issue?

Covered above.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Dec 18, 2024

Thank you for clarifying things! The mystery for me here is still the OpusTrainer error. Do you see the same error in the log for the Run 0? If not, it does mess up package versions. I wonder if your new docker images will solve this.

@bhearsum
Copy link
Collaborator

Thank you for clarifying things! The mystery for me here is still the OpusTrainer error. Do you see the same error in the log for the Run 0? If not, it does mess up package versions. I wonder if your new docker images will solve this.

Unfortunately, the log from run 0 is unrecoverable :(. The things I pasted above were from syslog, which is preserved.

@bhearsum
Copy link
Collaborator

We dug more into this on Matrix. What's going on here is that the task that failed used a opustrainer wheel from a previous task, which was built from a different revision:

[task 2024-12-17T15:49:32.122Z] Collecting opustrainer@ git+https://github.com/hplt-project/OpusTrainer.git@c966d7b353d6b3c6a09d9573f1ab6ba3221c1d21 (from -r ./checkouts/vcs/pipeline/train/requirements/train.txt (line 11))
[task 2024-12-17T15:49:50.039Z] Created wheel for opustrainer: filename=opustrainer-0.2-py3-none-any.whl size=43619 sha256=afeb9bd0ca2464da9e9cc80521e3f43bf6ae843c86f7235e3ab0ebf2911e35f1

(From https://firefox-ci-tc.services.mozilla.com/tasks/IKcqat0UQ6yVUBYJBDDbcA)

This happens because we're pinning opustrainer to a revision, but the wheel is getting a built with a version number in its filename. When the subsequent task ran, it saw an opustrainer 0.2 wheel, and simply used that rather than rebuilding.

This sucks, and sounds like a pip bug.

One thing this does mean is that spot termination was not a factor here.

A few suggestions as to how to proceed:

  • Bump the opustrainer version when you make changes to it
  • See if pip has a way to include the rev in the wheel name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

No branches or pull requests

2 participants