Skip to content

Commit

Permalink
Fix deadlock with parallel SBY procs each with parallel tasks
Browse files Browse the repository at this point in the history
When multiple SBY processes run in parallel (from a Makefile or other
job-server aware tool) and each SBY process runs tasks in parallel, each
with enough tasks to be limited by the total job count, it is possible
for the processes to race in such a way that every SBY process's helper
process is in a blocking read from the job-server but a job-token would
only become available as soon as any SBY process exits.

In that situation SBY doesn't actually need the job-token anymore and
only previously requested it as there was opportunity for parallelism.
It would immediatly return the token as soon as it is acquired. That's
usually sufficient to deal with no-longer-needed-but-requested tokens,
but when SBY is done, it needs to return the job-token held by the
parent process ASAP which it can only do by actually exiting, so we need
to interrupt the blocking read of SBY's helper process.

This could be done by sending a signal to the helper process, except
that Python made the decision in 3.5 to have automatic EINTR retry loops
around most system calls with no opt-out. That was part of the reason to
go with this specifc helper process design that avoids interrupting a
blocking read in the first place.

Using an exception raised from the signal handler instead might lose a
token when the signal arrives after the read returns, but before the
token is stored in a variable. You cannot recover from a lost token in
the context of the job-server protocol, so that's not an option. (This
can't happen with recent Python versions but that would depend on
undocumented behavior that could plausibly change again.)

Thankfully the only case where we need to interrupt the read is when SBY
is about to exit and will not request any further tokens. This allows us
to use a signal handler that uses dup2 to close and replace the
read-from fd with one that already is at EOF, making the next retry
return immediatly. (If we'd need to interrupt a read and continue
running we could also do this but the fd shuffling would be more
involved.)
  • Loading branch information
jix committed Jul 17, 2023
1 parent fbbbab2 commit edbc054
Showing 1 changed file with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions sbysrc/sby_jobserver.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,16 @@ def process_jobserver_environment():

def jobserver_helper(jobserver_read_fd, jobserver_write_fd, request_fd, response_fd):
"""Helper process to handle blocking jobserver pipes."""
def handle_sigusr1(*args):
# Since Python doesn't allow user code to handle EINTR anymore, we replace the
# jobserver fd with an fd at EOF to interrupt a blocking read in a way that
# cannot lose any read data
r, w = os.pipe()
os.close(w)
os.dup2(r, jobserver_read_fd)
os.close(r)
signal.signal(signal.SIGINT, signal.SIG_IGN)
signal.signal(signal.SIGUSR1, handle_sigusr1)
pending = 0
while True:
try:
Expand Down Expand Up @@ -110,6 +119,8 @@ def jobserver_helper(jobserver_read_fd, jobserver_write_fd, request_fd, response
except BlockingIOError:
select.select([jobserver_read_fd], [], [])
continue
if not token:
break

pending -= 1

Expand Down Expand Up @@ -240,6 +251,10 @@ def atexit_blocking(self):
# Closing the request pipe singals the helper that we want to exit
os.close(self.request_write_fd)

# Additionally we send a signal to interrupt a blocking read within the
# helper
self.helper_process.send_signal(signal.SIGUSR1)

# The helper might have been in the process of sending us some tokens, which
# we still need to return
while True:
Expand Down

0 comments on commit edbc054

Please sign in to comment.