forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'futexes-for-linus' of git://git.kernel.org/pub/scm/linu…
…x/kernel/git/tip/linux-2.6-tip * 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: fix restart in wait_requeue_pi futex: fix restart for early wakeup in futex_wait_requeue_pi() futex: cleanup error exit futex: remove the wait queue futex: add requeue-pi documentation futex: remove FUTEX_REQUEUE_PI (non CMP) futex: fix futex_wait_setup key handling sparc64: extend TI_RESTART_BLOCK space by 8 bytes futex: fixup unlocked requeue pi case futex: add requeue_pi functionality futex: split out futex value validation code futex: distangle futex_requeue() futex: add FUTEX_HAS_TIMEOUT flag to restart.futex.flags rt_mutex: add proxy lock routines futex: split out fixup owner logic from futex_lock_pi() futex: split out atomic logic from futex_lock_pi() futex: add helper to find the top prio waiter of a futex futex: separate futex_wait_queue_me() logic from futex_wait()
- Loading branch information
Showing
7 changed files
with
1,228 additions
and
358 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
Futex Requeue PI | ||
---------------- | ||
|
||
Requeueing of tasks from a non-PI futex to a PI futex requires | ||
special handling in order to ensure the underlying rt_mutex is never | ||
left without an owner if it has waiters; doing so would break the PI | ||
boosting logic [see rt-mutex-desgin.txt] For the purposes of | ||
brevity, this action will be referred to as "requeue_pi" throughout | ||
this document. Priority inheritance is abbreviated throughout as | ||
"PI". | ||
|
||
Motivation | ||
---------- | ||
|
||
Without requeue_pi, the glibc implementation of | ||
pthread_cond_broadcast() must resort to waking all the tasks waiting | ||
on a pthread_condvar and letting them try to sort out which task | ||
gets to run first in classic thundering-herd formation. An ideal | ||
implementation would wake the highest-priority waiter, and leave the | ||
rest to the natural wakeup inherent in unlocking the mutex | ||
associated with the condvar. | ||
|
||
Consider the simplified glibc calls: | ||
|
||
/* caller must lock mutex */ | ||
pthread_cond_wait(cond, mutex) | ||
{ | ||
lock(cond->__data.__lock); | ||
unlock(mutex); | ||
do { | ||
unlock(cond->__data.__lock); | ||
futex_wait(cond->__data.__futex); | ||
lock(cond->__data.__lock); | ||
} while(...) | ||
unlock(cond->__data.__lock); | ||
lock(mutex); | ||
} | ||
|
||
pthread_cond_broadcast(cond) | ||
{ | ||
lock(cond->__data.__lock); | ||
unlock(cond->__data.__lock); | ||
futex_requeue(cond->data.__futex, cond->mutex); | ||
} | ||
|
||
Once pthread_cond_broadcast() requeues the tasks, the cond->mutex | ||
has waiters. Note that pthread_cond_wait() attempts to lock the | ||
mutex only after it has returned to user space. This will leave the | ||
underlying rt_mutex with waiters, and no owner, breaking the | ||
previously mentioned PI-boosting algorithms. | ||
|
||
In order to support PI-aware pthread_condvar's, the kernel needs to | ||
be able to requeue tasks to PI futexes. This support implies that | ||
upon a successful futex_wait system call, the caller would return to | ||
user space already holding the PI futex. The glibc implementation | ||
would be modified as follows: | ||
|
||
|
||
/* caller must lock mutex */ | ||
pthread_cond_wait_pi(cond, mutex) | ||
{ | ||
lock(cond->__data.__lock); | ||
unlock(mutex); | ||
do { | ||
unlock(cond->__data.__lock); | ||
futex_wait_requeue_pi(cond->__data.__futex); | ||
lock(cond->__data.__lock); | ||
} while(...) | ||
unlock(cond->__data.__lock); | ||
/* the kernel acquired the the mutex for us */ | ||
} | ||
|
||
pthread_cond_broadcast_pi(cond) | ||
{ | ||
lock(cond->__data.__lock); | ||
unlock(cond->__data.__lock); | ||
futex_requeue_pi(cond->data.__futex, cond->mutex); | ||
} | ||
|
||
The actual glibc implementation will likely test for PI and make the | ||
necessary changes inside the existing calls rather than creating new | ||
calls for the PI cases. Similar changes are needed for | ||
pthread_cond_timedwait() and pthread_cond_signal(). | ||
|
||
Implementation | ||
-------------- | ||
|
||
In order to ensure the rt_mutex has an owner if it has waiters, it | ||
is necessary for both the requeue code, as well as the waiting code, | ||
to be able to acquire the rt_mutex before returning to user space. | ||
The requeue code cannot simply wake the waiter and leave it to | ||
acquire the rt_mutex as it would open a race window between the | ||
requeue call returning to user space and the waiter waking and | ||
starting to run. This is especially true in the uncontended case. | ||
|
||
The solution involves two new rt_mutex helper routines, | ||
rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which | ||
allow the requeue code to acquire an uncontended rt_mutex on behalf | ||
of the waiter and to enqueue the waiter on a contended rt_mutex. | ||
Two new system calls provide the kernel<->user interface to | ||
requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI. | ||
|
||
FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() | ||
and pthread_cond_timedwait()) to block on the initial futex and wait | ||
to be requeued to a PI-aware futex. The implementation is the | ||
result of a high-speed collision between futex_wait() and | ||
futex_lock_pi(), with some extra logic to check for the additional | ||
wake-up scenarios. | ||
|
||
FUTEX_REQUEUE_CMP_PI is called by the waker | ||
(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and | ||
possibly wake the waiting tasks. Internally, this system call is | ||
still handled by futex_requeue (by passing requeue_pi=1). Before | ||
requeueing, futex_requeue() attempts to acquire the requeue target | ||
PI futex on behalf of the top waiter. If it can, this waiter is | ||
woken. futex_requeue() then proceeds to requeue the remaining | ||
nr_wake+nr_requeue tasks to the PI futex, calling | ||
rt_mutex_start_proxy_lock() prior to each requeue to prepare the | ||
task as a waiter on the underlying rt_mutex. It is possible that | ||
the lock can be acquired at this stage as well, if so, the next | ||
waiter is woken to finish the acquisition of the lock. | ||
|
||
FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but | ||
their sum is all that really matters. futex_requeue() will wake or | ||
requeue up to nr_wake + nr_requeue tasks. It will wake only as many | ||
tasks as it can acquire the lock for, which in the majority of cases | ||
should be 0 as good programming practice dictates that the caller of | ||
either pthread_cond_broadcast() or pthread_cond_signal() acquire the | ||
mutex prior to making the call. FUTEX_REQUEUE_PI requires that | ||
nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for | ||
signal. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.