Skip to content

Commit

Permalink
Improve performance of exec_wait for multi-threaded use (Xilinx#7123)
Browse files Browse the repository at this point in the history
* Improve performance of exec_wait for multi-threaded use

First host thread invokes xrt_core::device::exec_wait and notifies
subsequent threads upon completion per exec_wait.

Performance testing of multipe threads running simple [run, wait]
loops of same kernel run object showed that a multi-threaded
application with n-threads had poor performace over n-process single
thread application.  The problem was narrowed down to a strange
combination of mutex lock interacting with waiting threads to ensure
mutual exclusion when calling exec_wait.  The waiting treated a mutex
lock more or less as queue when in fact it doesn't at all behave a
such and some threads got starved out waiting in vain.

In order to reduce multi-threaded wait time, condition variable wait
is used for subsequent threads calling exec_wait while some other
thread is busy running xrt_core::device::exec_wait. Condition variable
wait and wake up is faster than letting multiple threads wait on a
single mutex lock.  This means that device::exec_wait is done by the
first host thread that needs the call to exec_wait, while condition
variable notification is used for additional threads.

Signed-off-by: Soren Soe <[email protected]>

* notify outside critical section and add exit condition for wait

Signed-off-by: Soren Soe <[email protected]>
  • Loading branch information
stsoe authored Oct 27, 2022
1 parent d125fea commit 69d94d9
Showing 1 changed file with 81 additions and 17 deletions.
98 changes: 81 additions & 17 deletions src/runtime_src/core/common/api/hw_queue.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,18 @@
#include "core/include/xcl_hwqueue.h"

#include <algorithm>
#include <atomic>
#include <chrono>
#include <condition_variable>
#include <map>
#include <memory>
#include <mutex>
#include <thread>

#include <iostream>

using namespace std::chrono_literals;

////////////////////////////////////////////////////////////////
// Main command execution interface for scheduling commands for
// execution and waiting for commands to complete via KDS.
Expand Down Expand Up @@ -410,56 +416,114 @@ class kds_device : public hw_queue_impl
{
xrt_core::device* m_device;
std::mutex m_exec_wait_mutex;
uint64_t m_exec_wait_call_count = 0;
std::condition_variable m_work;
uint64_t m_exec_wait_call_count {0};
uint32_t m_exec_wait_active {0};

// Thread safe shim level exec wait call. This function allows
// multiple threads to call exec_wait through same device handle.
//
// In multi-threaded applications it is possible that a call to shim
// level exec_wait by one thread covers completion for other
// threads. Without careful synchronization, a thread that calls
// device->exec_wait could end up being stuck either forever or
// device::exec_wait could end up being stuck either forever or
// until some other unrelated command completes. This function
// prevents that scenario from happening.
//
// Thread local storage keeps a call count that syncs with the
// number of times device->exec_wait has been called. If thread
// number of times device::exec_wait has been called. If thread
// local call count is different from the global count, then this
// function resets the thread local call count and return without
// calling shim exec_wait.
// calling device::exec_wait.
//
// The specified timeout has effect only when underlying xclExecWait
// times out. The timeout can be masked if device is busy and many
// In order to reduce multi-threaded wait time, condition variable
// wait is used for subsequent threads calling this function while
// some other thread is busy running device->exec_wait. Condition
// variable wait and wake up is faster than letting multiple threads
// wait on a single mutex lock. This means that device::exec_wait
// is done by the first host thread that needs the call to
// exec_wait, while condition variable notification is used for
// additional threads.
//
// The specified timeout affects the waiting for device::exec_wait
// only. The timeout can be masked if device is busy and many
// commands complete with the specified timeout.
std::cv_status
exec_wait(size_t timeout_ms=0)
{
static thread_local uint64_t thread_exec_wait_call_count = 0;
std::lock_guard lk(m_exec_wait_mutex);

if (thread_exec_wait_call_count != m_exec_wait_call_count) {
// some other thread has called exec_wait and may have
// covered this thread's commands, synchronize thread
// local call count and return to caller.
thread_exec_wait_call_count = m_exec_wait_call_count;
return std::cv_status::no_timeout;

// Critical section to check if this thread needs to call
// device::exec_wait or should wait on some other thread
// completing the call.
{
std::unique_lock lk(m_exec_wait_mutex);
if (thread_exec_wait_call_count != m_exec_wait_call_count) {
// Some other thread has called exec_wait and may have
// covered this thread's commands, synchronize thread
// local call count and return to caller.
thread_exec_wait_call_count = m_exec_wait_call_count;
return std::cv_status::no_timeout;
}

if (m_exec_wait_active) {
// Some other thread is calling device::exec_wait, wait
// for it complete its work and notify this thread
auto status = std::cv_status::no_timeout;
if (timeout_ms)
status = (m_work.wait_for(lk, timeout_ms * 1ms,
[this] {
return thread_exec_wait_call_count != m_exec_wait_call_count;
}))
? std::cv_status::no_timeout
: std::cv_status::timeout;
else
m_work.wait(lk,
[this] {
return thread_exec_wait_call_count != m_exec_wait_call_count;
});

// The other thread has completed its exec_wait call,
// sync with current global call count and return
thread_exec_wait_call_count = m_exec_wait_call_count;
return status;
}

// Critical section updates wait status to prevent other threads
// from calling exec_wait while this thread is calling it.
++m_exec_wait_active;
}

// Call device exec_wait without keeping the lock to allow other
// threads to proceed into a conditional wait. This scales better
// than one giant exclusive region. It is guaranteed that only
// this thread will be here because other threads are blocked by
// the active count that is only modified in the exclusive region.
// assert(m_exec_wait_active == 1);
auto status = std::cv_status::no_timeout;
if (timeout_ms) {
// device exec_wait is a system poll which returns
// 0 when specified timeout is exceeded without any
// file descriptors to read
if (m_device->exec_wait(static_cast<int>(timeout_ms)) == 0)
// nothing happened within specified time
status = std::cv_status::timeout;
}
else {
// wait for ever for some command to complete
while (m_device->exec_wait(1000) == 0) {}
}

// synchronize this thread with total call count
thread_exec_wait_call_count = ++m_exec_wait_call_count;
// Acquire lock before updating shared state
{
std::lock_guard lk(m_exec_wait_mutex);
thread_exec_wait_call_count = ++m_exec_wait_call_count;
--m_exec_wait_active;
}

// Notify any waiting threads so they can check command status and
// possibly call exec_wait again.
m_work.notify_all();

return status;
}
Expand All @@ -478,7 +542,7 @@ class kds_device : public hw_queue_impl
std::cv_status
wait(const xrt_core::command* cmd, size_t timeout_ms) override
{
auto pkt = cmd->get_ert_packet();
volatile auto pkt = cmd->get_ert_packet();
while (pkt->state < ERT_CMD_STATE_COMPLETED) {
// return immediately on timeout
if (exec_wait(timeout_ms) == std::cv_status::timeout)
Expand Down

0 comments on commit 69d94d9

Please sign in to comment.