diff --git a/documentation/library_guide/kernel_templates/single_pass_scan.rst b/documentation/library_guide/kernel_templates/single_pass_scan.rst new file mode 100644 index 00000000000..80ade73812e --- /dev/null +++ b/documentation/library_guide/kernel_templates/single_pass_scan.rst @@ -0,0 +1,208 @@ +Inclusive Scan +############## + +-------------------------------- +inclusive_scan Function Template +-------------------------------- + +The ``inclusive_scan`` function computes the inclusive prefix sum using a given binary operation. +The function implements a single-pass algorithm, where each input element is read exactly once from +global memory and each output element is written to exactly once in global memory. + +The algorithm is designed to be compatible with a variety of devices that provide at least parallel +forward progress guarantees between work-groups, due to cross-work-group communication. Additionally, it +requires support for device USM (Unified Shared Memory). It has been verified to be compatible +with `Intel® Data Center GPU Max Series +`_. + +A synopsis of the ``inclusive_scan`` function is provided below: + +.. code:: cpp + + // defined in + + namespace oneapi::dpl::experimental::kt::gpu { + + template + sycl::event + inclusive_scan (sycl::queue q, InIterator in_begin, InIterator in_end, OutIterator out_begin, + BinaryOp binary_op, KernelParam param); // (1) + + template + sycl::event + inclusive_scan (sycl::queue q, InRng in_rng, OutRng out_rng, BinaryOp binary_op, + KernelParam param) // (2) + + } + + +Parameters +---------- + ++------------------------------------------------+---------------------------------------------------------------------+ +| Name | Description | ++================================================+=====================================================================+ +| ``q`` | The SYCL* queue where kernels are submitted. | ++------------------------------------------------+---------------------------------------------------------------------+ +| | | +| | The sequences to apply the algorithm to. | +| - ``in_begin``, ``in_end``, ``out_begin`` (1), | Supported sequence types: | +| - ``in_rng``, ``out_rng`` (2). | | +| | - :ref:`USM pointers ` (1), | +| | - :ref:`oneapi::dpl::begin and oneapi::dpl::end | +| | ` (1), | +| | - ``sycl::buffer`` (2), | +| | - :ref:`oneapi::dpl::experimental::ranges::views::all | +| | ` (2), | +| | - :ref:`oneapi::dpl::experimental::ranges::views::subrange | +| | ` (2). | +| | | ++------------------------------------------------+---------------------------------------------------------------------+ +| ``binary_op`` | A function object that is applied to the elements of the input. | +| | | ++------------------------------------------------+---------------------------------------------------------------------+ +| ``param`` | A :doc:`kernel_param ` object. | +| | | ++------------------------------------------------+---------------------------------------------------------------------+ + + +**Type Requirements**: + +- The element type of sequence to scan must be a 32-bit or 64-bit bit C++ integral or floating-point type. +- The result is non-deterministic if the binary operator is non-associative (such as in floating-point addition) + or non-commutative. + + +.. note:: + + Current limitations: + + - The function will internally block until the issued kernels have completed execution. + Although intended in the future to be an asynchronous call, the algorithm is currently synchronous. + - The SYCL device associated with the provided queue must support 64-bit atomic operations if the element type is 64-bits. + - There must be a known identity value for the provided combination of the element type and the binary operation. That is, ``sycl::has_known_identity_v`` must evaluate to true. Such operators are listed in the `SYCL 2020 specification `_. + +Return Value +------------ + +A ``sycl::event`` object representing the status of the algorithm execution. + +-------------- +Usage Examples +-------------- + + +inclusive_scan Example +---------------------- + +.. code:: cpp + + // possible build and run commands: + // icpx -fsycl inclusive_scan.cpp -o inclusive_scan -I /path/to/oneDPL/include && ./inclusive_scan + + #include + #include + #include + + #include + + namespace kt = oneapi::dpl::experimental::kt; + + int main() + { + std::size_t n = 6; + sycl::queue q{sycl::gpu_selector_v}; + std::uint32_t* arr = sycl::malloc_shared(n, q); + std::uint32_t* out = sycl::malloc_shared(n, q); + + // initialize + arr[0] = 1, arr[1] = 2, arr[2] = 1, arr[3] = 3, arr[4] = 1, arr[5] = 2; + + // scan + auto e = kt::gpu::inclusive_scan(q, arr, arr + n, out, std::plus{}, kt::kernel_param<256, 8>{}); + e.wait(); + + // print + for(std::size_t i = 0; i < n; ++i) + std::cout << out[i] << ' '; + std::cout << '\n'; + + sycl::free(arr, q); + sycl::free(out, q); + return 0; + } + +**Output:** + +.. code:: none + + 1 3 4 7 8 10 + +.. _scan-memory-requirements: + +------------------- +Memory Requirements +------------------- + +The algorithm uses global and local device memory (see `SYCL 2020 Specification +`__) +for intermediate data storage. For the algorithm to operate correctly, there must be enough memory +on the device. It throws a ``std::bad_alloc`` exception if there is not enough global device memory. The behavior is undefined if there is not enough local memory. The amount of memory that is required +depends on input data and configuration parameters, as described below. + +Global Memory Requirements +-------------------------- + +Global memory is used for copying the input sequence and storing internal data such as status flags. +The used amount depends on many parameters; below is an approximation in bytes: + +2 * V * N \ :sub:`flags` + 4 * N \ :sub:`flags` + +where V is the number of bytes needed to store the input value type. + +The value of N\ :sub:`flags` represents the number of work-groups and depends on ``param.data_per_workitem`` and ``param.workgroup_size``. +It can be approximated by dividing the number of input elements N by the product of ``param.data_per_workitem`` and ``param.workgroup_size``. + +.. note:: + + If the number of input elements can be efficiently processed by a single work-group, + the kernel template is executed by a single work-group and does not use any global memory. + + +Local Memory Requirements +------------------------- + +Local memory is used for storing elements of the input that are to be scanned by a single work-group. +The used amount is denoted as N\ :sub:`elems_per_workgroup`, which equals to ``sizeof(key_type) * param.data_per_workitem * param.workgroup_size``. + +Some amount of local memory is also used by the calls to SYCL's group reduction and group scan. The amount of memory used particularly +for these calls is implementation dependent. + +----------------------------------------- +Recommended Settings for Best Performance +----------------------------------------- + +The general advice is to choose kernel parameters based on performance measurements and profiling information. +The initial configuration may be selected according to these high-level guidelines: + + +- When the number of elements is small enough to fit within single work-group, the algorithm will ignore kernel + parameters and instead dispatch to a single work-group version, where it is generally more efficient. + +- Generally, utilizing all available + compute cores is key for better performance. To allow sufficient work to satisfy all + X\ :sup:`e`-cores [#fnote1]_ on a GPU, use ``param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count``. + +- On devices with multiple tiles, it may prove beneficial to experiment with different tile hierarchies as described + in `Options for using a GPU Tile Hierarchy `_. + + +.. warning:: + + Avoid setting too large ``param.data_per_workitem`` and ``param.workgroup_size`` values. + Make sure that :ref:`Memory requirements ` are satisfied. + +.. [#fnote1] The X\ :sup:`e`-core term is described in the `oneAPI GPU Optimization Guide + `_. + Check the number of cores in the device specification, such as `Intel® Data Center GPU Max specification + `_. diff --git a/documentation/library_guide/kernel_templates_main.rst b/documentation/library_guide/kernel_templates_main.rst index f50a28893ad..77affdc4165 100644 --- a/documentation/library_guide/kernel_templates_main.rst +++ b/documentation/library_guide/kernel_templates_main.rst @@ -15,6 +15,7 @@ The primary API namespace is ``oneapi::dpl::experimental::kt``, and nested names * :doc:`Kernel Configuration `. Generic structure for configuring a kernel template. * :doc:`ESIMD-based kernel templates `. Algorithms implemented with the Explicit SIMD SYCL extension. +* :doc:`Inclusive scan algorithm `. Inclusive scan kernel template algorithm using a single-pass approach. .. toctree:: :maxdepth: 2 @@ -24,3 +25,4 @@ The primary API namespace is ``oneapi::dpl::experimental::kt``, and nested names kernel_templates/kernel_configuration kernel_templates/esimd_main + kernel_templates/single_pass_scan