-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on resuming training after preemption #964
Comments
I'm going to separate this slightly. First, a bit on the termination of run 0, and then about run 1 not resuming properly. As a quick tl;dr - we definitely did not try to resume in run 1. Run 0I was able to dig up worker logs for run 0. Most notably we have:
So, this was most definitely a spot termination. And it looks to me like we didn't have time to upload all of the artifacts before we got the hard poweroff. (Otherwise the log and other artifacts would be available.) There may be two things influencing this:
Run 1As far as I can tell, run 1 did not try to resume, which is what I would expect in this situation. If it had, we would expect to see The full traceback that caused the task to fail is:
The first half is about W&B reporting, and appears not to be fatal based on the "The error is ignored..." message. The second traceback (that starts on line 475 of train.py) appears to be the fatal part. Specifically:
That comes from this piece of code, which looks like it is supposed to run after opustrainer and marian have run. Specific questionsTo reply to your specific questions as well:
Covered above
This just means the package was already there; this worker previously ran tasks, so it had things like this cached already:
I'm a bit suspicious that W&B exiting is causing the entire pipeline to fail. This seems to be run through run_command_pipeline; it's not immediately clear to me whether or not the W&B parser exiting will cause this to return or not.
Covered above. |
Thank you for clarifying things! The mystery for me here is still the OpusTrainer error. Do you see the same error in the log for the Run 0? If not, it does mess up package versions. I wonder if your new docker images will solve this. |
Unfortunately, the log from run 0 is unrecoverable :(. The things I pasted above were from syslog, which is preserved. |
We dug more into this on Matrix. What's going on here is that the task that failed used a opustrainer wheel from a previous task, which was built from a different revision:
(From https://firefox-ci-tc.services.mozilla.com/tasks/IKcqat0UQ6yVUBYJBDDbcA) This happens because we're pinning opustrainer to a revision, but the wheel is getting a built with a version number in its filename. When the subsequent task ran, it saw an opustrainer 0.2 wheel, and simply used that rather than rebuilding. This sucks, and sounds like a pip bug. One thing this does mean is that spot termination was not a factor here. A few suggestions as to how to proceed:
|
https://firefox-ci-tc.services.mozilla.com/tasks/Tiqin_UwSJSd8YDULbaV0g/runs/1/logs/public/logs/live.log
There are a bunch of things happened here:
How did it get those already installed? For example in CI we can see the logs when they are being installed:
@bhearsum I need your help to figure out the situation with packages. I assume it somehow pulled the wrong package from the cache. Then OpusTrainer used the old version and couldn't use my fix where I added the
tag
parameter.The text was updated successfully, but these errors were encountered: