Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in free-threaded Python (fixes issue #867) #887

Merged
merged 2 commits into from
Jan 28, 2025
Merged

Conversation

wjakob
Copy link
Owner

@wjakob wjakob commented Jan 28, 2025

This commit addresses an issue arising when multiple threads want to access the Python object associated with the same C++ instance, which does not exist yet and therefore must be created. @vfdev-5 reported that TSAN detects a race condition in code that uses this pattern, caused by concurrent unprotected reads/writes of internal nb_inst fields.

To fix this issue, we split instance creation and registration into a two-step process. The registration is only done when the object is fully constructed.

wjakob and others added 2 commits January 28, 2025 21:53
This commit addresses an issue arising when multiple threads want to
access the Python object associated with the same C++ instance, which
does not exist yet and therefore must be created. @vfdev-5 reported that
TSAN detects a race condition in code that uses this pattern, caused by
concurrent unprotected reads/writes of internal ``nb_inst`` fields.

To fix this issue, we split instance creation and registration into
a two-step process. The latter is only done when the object is fully
constructed.
@wjakob
Copy link
Owner Author

wjakob commented Jan 28, 2025

This commit fixes the race condition in free-threaded Python.

cc @vfdev-5 @hawkinsp @oremanj

I punted on the idea of strictly enforcing an invariant that a (C++ instance pointer, type) pair can map to at most one Python object. See branch ft-race-alt / commit 1415e05 for a failed (?) attempt of such an implementation. In that alternative implementation, I check the instance list for duplicates while registering a new instance. If a duplicate exists, I return that one instead and destroy the unregistered instance. This passes the tests but makes me worry that it's going to be difficult to properly enforce the return value policies.

For example, what if some threads return an object via rv_policy::reference, and then one single thread takes ownership via rv_policy::take_ownership. If we potentially return some other thread's instance, we would have to "patch in" the update the flags, which sounds complicated. (This is kind of a silly example because it is not handled correctly at the moment even without this patch. But it seems fixable if the creation of duplicate instances is allowed)

@wjakob wjakob merged commit d793091 into master Jan 28, 2025
31 checks passed
@wjakob wjakob deleted the ft-race branch January 28, 2025 13:26
@vfdev-5
Copy link
Contributor

vfdev-5 commented Jan 28, 2025

@wjakob thanks a lot for fixing the race!

I have a quick question about the fix for my better understanding of the internals.

To fix this issue, we split instance creation and registration into a two-step process. The registration is only done when the object is fully constructed.

The above mentioned registration is done when object is fully constructed for the case when the object was created by inst_new_ext method.

The method inst_new_int also does a simple registration (ie shard.inst_c2p.try_emplace) once the object is constructed. Wont we have similar situation where we should also perform the registration of the objects built with inst_new_int later like for those built with inst_new_ext ?

@wjakob
Copy link
Owner Author

wjakob commented Jan 28, 2025

In inst_new_int, that whole issue cannot appear because we're registering a new instance that nobody has seen yet. So there cannot be a race on the pointer identity of this instance.

@makslevental
Copy link

@wjakob "peanut gallery"-level question: can we get a PATCH version bump for this change so that we can require it downstream (in MLIR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants