Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Currently, the code use
SIGTERM
to kill all the children processes. This is the [ideal way to kill a process because it can be blocked or handled in various ways](https://komodor.com/learn/what-is-sigkill-signal-9-fast-termination-of-linux-containers/#:~:text=SIGKILL%20(also%20known%20as%20Unix,or%20handled%20in%20various%20ways.). However, an issue withSIGTERM
I've noticed is that it can hang - the children processes are never killed and the code gets stuck waiting forever. In that situation, we'd want to use aSIGKILL
which will force the process to terminate.In this PR, I've added logic that will wait some amount time for
SIGTERM
to kill the process, then after the timeout threshold is exceed, it will kill any remaining / lingering jobs withSIGKILL
. I've also made the timeout threshold a parameter in the CLI so users can customize the timeout.Usage
SIGTERM
By default, the timeout threshold is 30 seconds. I selected this number because I ran GPU Burn several times with different durations and found 30 seconds was about the average time it took for the children processes to exit gracefully with
SIGTERM
when no issues were present (i.e., didn't hang).SIGKILL
by specifying a very small timeout thresholdIn this case, it will jump to SIGKILL immediately since the timeout threshold is 0 seconds.
SIGTERM
by specifying a reasonable time windowIn this case, it will jump to SIGKILL after 120 seconds / 2 minutes, which is a reasonable upper bound for
SIGTERM
to do it's jump (based on my empirical evidence).Test Plan
Run the above examples - works as expected.