For both part A and B, all of the task system implementation uses thread pool to manage threads, except in TaskSystemSerial
and TaskSystemParallelSpawn
to reduce the overhead from thread creations at execution time. All of the implementation also uses dynamic assignment where each worker thread takes the first available task from a shared task queue, removes the task from queue so no other worker thread repeats the same task and begins executing the task. Once the execution is done, there is a separate queue to report the finished task execution back to main thread, so that the main thread knows when all tasks in the work queue have been finished and can return to the caller.
In part A, the bookkeeping is straightforward as there is only one IRunnable*
task to execute at a given time. Instead of using sophisticated work queue, every thread shares a variable current_task
that keeps track of the latest available task to execute, and another variable task_done
to keep track of number of tasks that have finished. Each thread holds two separate mutexes on the shared variables: one for current_task
and the other for task_done
. The thread only locks the mutexes when it needs to increment the variables and unlocks them once the increment is done. This is so that the actual task execution can be done concurrently between threads.
In part B, the bookkeeping gets more complicated as the task system needs to track task dependencies. The execution steps are as follows:
- The dependencies are tracked in a map, called
tasks_dep
withTaskID
as key andset<TaskID>
as value if the keyTaskID
has dependencies on otherTaskID
in the sets. This map is constructed duringrunAsyncWithDeps()
as tasks are added to the system. - Upon execution of
sync()
, we can identify tasks that are ready to execute by scanning the map and adding it to a separate queue of ready tasks (calledready_tasks
).TaskID
that are ready to execute has an emptyset<TaskID>
associated with its key according to our design of the map. ThisTaskID
key is also removed from the map. - Once the tasks from
ready_tasks
finish executing, they are added to another queue (calleddone_tasks
) that is shared with the main thread. - The main thread will iterate through
tasks_dep
entries and removeTaskID
that are present inready_tasks
, which is conceptually the same as removing the task dependencies from another task once we know that thisTaskID
is finished. - Repeat 2-4 until
tasks_dep
is empty and there are no more tasks waiting indone_tasks
.
At the end of run()
(for part A) and sync()
(for part B), the thread pool is shut down by flipping the flag done_work
that signals thread worker to stop spinning in the loop and waiting for more tasks. Then the threads are joined and finishes the teardown.
The sequential task system performs best when there are small number of tasks and each task can be completed very quickly. This is because the overhead from thread creation and synchronization outweighs the benefits gained from concurrent execution. For example, in super_super_light
test where there are relatively smaller tasks to execute compared to other tests like ping_pong_*
, serial implementation finishes in 9.051ms, close to 8.805ms for the sleeping thread pool implementation and faster than 34.069ms of spawn-every-launch implementation.
The spawn-every-launch implementation will perform as well as the more advanced implementations when each task takes longer to compute or when there are large number of tasks. As each task gets more expensive to compute or the task increases in number, the overhead from thread creation will appear smaller when compared to the total execution time. For example, in ping_pong_equal
where there are ~500k tasks and each task loops for 32 iterations, spawn-every-launch implementation takes 293.946ms compared to 271.48ms for the sleeping thread pool implementation. On the other hand, spawn-every-launch implementation will perform worse when there are small number of tasks or when each task is cheap to compute. This is visible in super_super_light
test where spawn-every-launch implementation takes 37.286ms to complete compared to 9.322ms for the most advanced implementation.