Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout Threshold for SIGTERM #78

Merged
merged 3 commits into from
Apr 5, 2023
Merged

Timeout Threshold for SIGTERM #78

merged 3 commits into from
Apr 5, 2023

Conversation

nahmed3536
Copy link
Contributor

Summary

Currently, the code use SIGTERM to kill all the children processes. This is the [ideal way to kill a process because it can be blocked or handled in various ways](https://komodor.com/learn/what-is-sigkill-signal-9-fast-termination-of-linux-containers/#:~:text=SIGKILL%20(also%20known%20as%20Unix,or%20handled%20in%20various%20ways.). However, an issue with SIGTERM I've noticed is that it can hang - the children processes are never killed and the code gets stuck waiting forever. In that situation, we'd want to use a SIGKILL which will force the process to terminate.

In this PR, I've added logic that will wait some amount time for SIGTERM to kill the process, then after the timeout threshold is exceed, it will kill any remaining / lingering jobs with SIGKILL. I've also made the timeout threshold a parameter in the CLI so users can customize the timeout.

Usage

  1. Can run GPU Burn as is without specifying a timeout threshold time for SIGTERM
./gpu_burn

By default, the timeout threshold is 30 seconds. I selected this number because I ran GPU Burn several times with different durations and found 30 seconds was about the average time it took for the children processes to exit gracefully with SIGTERM when no issues were present (i.e., didn't hang).

  1. Can exclusively use SIGKILL by specifying a very small timeout threshold
./gpu_burn -stts 0

In this case, it will jump to SIGKILL immediately since the timeout threshold is 0 seconds.

  1. Can exclusively use SIGTERM by specifying a reasonable time window
./gpu_burn -stts 120

In this case, it will jump to SIGKILL after 120 seconds / 2 minutes, which is a reasonable upper bound for SIGTERM to do it's jump (based on my empirical evidence).

Test Plan

Run the above examples - works as expected.

Copy link
Owner

@wilicc wilicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah why not. It's tricky to kill ongoing GPU work but shouldn't break anything.

@wilicc wilicc merged commit 327ef88 into wilicc:master Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants