forked from uxlfoundation/oneDPL
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Documentation for scan kernel template (uxlfoundation#1559)
Co-authored-by: Dmitriy Sobolev <[email protected]>
- Loading branch information
1 parent
e8563f6
commit 9e73723
Showing
2 changed files
with
210 additions
and
0 deletions.
There are no files selected for viewing
208 changes: 208 additions & 0 deletions
208
documentation/library_guide/kernel_templates/single_pass_scan.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,208 @@ | ||
Inclusive Scan | ||
############## | ||
|
||
-------------------------------- | ||
inclusive_scan Function Template | ||
-------------------------------- | ||
|
||
The ``inclusive_scan`` function computes the inclusive prefix sum using a given binary operation. | ||
The function implements a single-pass algorithm, where each input element is read exactly once from | ||
global memory and each output element is written to exactly once in global memory. | ||
|
||
The algorithm is designed to be compatible with a variety of devices that provide at least parallel | ||
forward progress guarantees between work-groups, due to cross-work-group communication. Additionally, it | ||
requires support for device USM (Unified Shared Memory). It has been verified to be compatible | ||
with `Intel® Data Center GPU Max Series | ||
<https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series/products.html>`_. | ||
|
||
A synopsis of the ``inclusive_scan`` function is provided below: | ||
|
||
.. code:: cpp | ||
// defined in <oneapi/dpl/experimental/kernel_templates> | ||
namespace oneapi::dpl::experimental::kt::gpu { | ||
template <typename InIterator, typename OutIterator, typename BinaryOp> | ||
sycl::event | ||
inclusive_scan (sycl::queue q, InIterator in_begin, InIterator in_end, OutIterator out_begin, | ||
BinaryOp binary_op, KernelParam param); // (1) | ||
template <typename InRng, typename OutRng, typename BinaryOp> | ||
sycl::event | ||
inclusive_scan (sycl::queue q, InRng in_rng, OutRng out_rng, BinaryOp binary_op, | ||
KernelParam param) // (2) | ||
} | ||
Parameters | ||
---------- | ||
|
||
+------------------------------------------------+---------------------------------------------------------------------+ | ||
| Name | Description | | ||
+================================================+=====================================================================+ | ||
| ``q`` | The SYCL* queue where kernels are submitted. | | ||
+------------------------------------------------+---------------------------------------------------------------------+ | ||
| | | | ||
| | The sequences to apply the algorithm to. | | ||
| - ``in_begin``, ``in_end``, ``out_begin`` (1), | Supported sequence types: | | ||
| - ``in_rng``, ``out_rng`` (2). | | | ||
| | - :ref:`USM pointers <use-usm>` (1), | | ||
| | - :ref:`oneapi::dpl::begin and oneapi::dpl::end | | ||
| | <use-buffer-wrappers>` (1), | | ||
| | - ``sycl::buffer`` (2), | | ||
| | - :ref:`oneapi::dpl::experimental::ranges::views::all | | ||
| | <viewable-ranges>` (2), | | ||
| | - :ref:`oneapi::dpl::experimental::ranges::views::subrange | | ||
| | <viewable-ranges>` (2). | | ||
| | | | ||
+------------------------------------------------+---------------------------------------------------------------------+ | ||
| ``binary_op`` | A function object that is applied to the elements of the input. | | ||
| | | | ||
+------------------------------------------------+---------------------------------------------------------------------+ | ||
| ``param`` | A :doc:`kernel_param <kernel_configuration>` object. | | ||
| | | | ||
+------------------------------------------------+---------------------------------------------------------------------+ | ||
|
||
|
||
**Type Requirements**: | ||
|
||
- The element type of sequence to scan must be a 32-bit or 64-bit bit C++ integral or floating-point type. | ||
- The result is non-deterministic if the binary operator is non-associative (such as in floating-point addition) | ||
or non-commutative. | ||
|
||
|
||
.. note:: | ||
|
||
Current limitations: | ||
|
||
- The function will internally block until the issued kernels have completed execution. | ||
Although intended in the future to be an asynchronous call, the algorithm is currently synchronous. | ||
- The SYCL device associated with the provided queue must support 64-bit atomic operations if the element type is 64-bits. | ||
- There must be a known identity value for the provided combination of the element type and the binary operation. That is, ``sycl::has_known_identity_v`` must evaluate to true. Such operators are listed in the `SYCL 2020 specification <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.identities>`_. | ||
|
||
Return Value | ||
------------ | ||
|
||
A ``sycl::event`` object representing the status of the algorithm execution. | ||
|
||
-------------- | ||
Usage Examples | ||
-------------- | ||
|
||
|
||
inclusive_scan Example | ||
---------------------- | ||
|
||
.. code:: cpp | ||
// possible build and run commands: | ||
// icpx -fsycl inclusive_scan.cpp -o inclusive_scan -I /path/to/oneDPL/include && ./inclusive_scan | ||
#include <cstdint> | ||
#include <iostream> | ||
#include <sycl/sycl.hpp> | ||
#include <oneapi/dpl/experimental/kernel_templates> | ||
namespace kt = oneapi::dpl::experimental::kt; | ||
int main() | ||
{ | ||
std::size_t n = 6; | ||
sycl::queue q{sycl::gpu_selector_v}; | ||
std::uint32_t* arr = sycl::malloc_shared<std::uint32_t>(n, q); | ||
std::uint32_t* out = sycl::malloc_shared<std::uint32_t>(n, q); | ||
// initialize | ||
arr[0] = 1, arr[1] = 2, arr[2] = 1, arr[3] = 3, arr[4] = 1, arr[5] = 2; | ||
// scan | ||
auto e = kt::gpu::inclusive_scan(q, arr, arr + n, out, std::plus<std::uint32_t>{}, kt::kernel_param<256, 8>{}); | ||
e.wait(); | ||
for(std::size_t i = 0; i < n; ++i) | ||
std::cout << out[i] << ' '; | ||
std::cout << '\n'; | ||
sycl::free(arr, q); | ||
sycl::free(out, q); | ||
return 0; | ||
} | ||
**Output:** | ||
|
||
.. code:: none | ||
1 3 4 7 8 10 | ||
.. _scan-memory-requirements: | ||
|
||
------------------- | ||
Memory Requirements | ||
------------------- | ||
|
||
The algorithm uses global and local device memory (see `SYCL 2020 Specification | ||
<https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_sycl_device_memory_model>`__) | ||
for intermediate data storage. For the algorithm to operate correctly, there must be enough memory | ||
on the device. It throws a ``std::bad_alloc`` exception if there is not enough global device memory. The behavior is undefined if there is not enough local memory. The amount of memory that is required | ||
depends on input data and configuration parameters, as described below. | ||
|
||
Global Memory Requirements | ||
-------------------------- | ||
|
||
Global memory is used for copying the input sequence and storing internal data such as status flags. | ||
The used amount depends on many parameters; below is an approximation in bytes: | ||
|
||
2 * V * N \ :sub:`flags` + 4 * N \ :sub:`flags` | ||
|
||
where V is the number of bytes needed to store the input value type. | ||
|
||
The value of N\ :sub:`flags` represents the number of work-groups and depends on ``param.data_per_workitem`` and ``param.workgroup_size``. | ||
It can be approximated by dividing the number of input elements N by the product of ``param.data_per_workitem`` and ``param.workgroup_size``. | ||
|
||
.. note:: | ||
|
||
If the number of input elements can be efficiently processed by a single work-group, | ||
the kernel template is executed by a single work-group and does not use any global memory. | ||
|
||
|
||
Local Memory Requirements | ||
------------------------- | ||
|
||
Local memory is used for storing elements of the input that are to be scanned by a single work-group. | ||
The used amount is denoted as N\ :sub:`elems_per_workgroup`, which equals to ``sizeof(key_type) * param.data_per_workitem * param.workgroup_size``. | ||
|
||
Some amount of local memory is also used by the calls to SYCL's group reduction and group scan. The amount of memory used particularly | ||
for these calls is implementation dependent. | ||
|
||
----------------------------------------- | ||
Recommended Settings for Best Performance | ||
----------------------------------------- | ||
|
||
The general advice is to choose kernel parameters based on performance measurements and profiling information. | ||
The initial configuration may be selected according to these high-level guidelines: | ||
|
||
|
||
- When the number of elements is small enough to fit within single work-group, the algorithm will ignore kernel | ||
parameters and instead dispatch to a single work-group version, where it is generally more efficient. | ||
|
||
- Generally, utilizing all available | ||
compute cores is key for better performance. To allow sufficient work to satisfy all | ||
X\ :sup:`e`-cores [#fnote1]_ on a GPU, use ``param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count``. | ||
|
||
- On devices with multiple tiles, it may prove beneficial to experiment with different tile hierarchies as described | ||
in `Options for using a GPU Tile Hierarchy <https://www.intel.com/content/www/us/en/developer/articles/technical/flattening-gpu-tile-hierarchy.html>`_. | ||
|
||
|
||
.. warning:: | ||
|
||
Avoid setting too large ``param.data_per_workitem`` and ``param.workgroup_size`` values. | ||
Make sure that :ref:`Memory requirements <scan-memory-requirements>` are satisfied. | ||
|
||
.. [#fnote1] The X\ :sup:`e`-core term is described in the `oneAPI GPU Optimization Guide | ||
<https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/intel-xe-gpu-architecture.html#XE-CORE>`_. | ||
Check the number of cores in the device specification, such as `Intel® Data Center GPU Max specification | ||
<https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series/products.html>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters