Skip to content

Commit

Permalink
Documentation update (LLNL#372)
Browse files Browse the repository at this point in the history
* Add collective-output-channel example code

* Fix OpenMP metrics

* Add OpenMP profiling documentation

* Documentation updates; add CUDA profiling howto [skip ci]
  • Loading branch information
daboehme authored Jun 25, 2021
1 parent c7256ef commit 903c2ba
Show file tree
Hide file tree
Showing 11 changed files with 478 additions and 13 deletions.
184 changes: 184 additions & 0 deletions doc/sphinx/CUDA.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
CUDA profiling
=======================================

Caliper can profile CUDA API functions and CUDA device-side activities like
kernel executions and memory copies. This requires Caliper to be built with
CUpti support (`-DWITH_CUPTI=On`).

Profiling CUDA host-side API functions
---------------------------------------

The `profile.cuda` option is available for many ConfigManager configs like
`runtime-report`, `hatchet-region-profile`, and `spot`. It intercepts host-side
CUDA API functions like ``cudaMemcpy`` and ``cudaDeviceSynchronize``.
The example below shows `runtime-report` output with the `profile.cuda` option
enabled. We can see the time spent in CUDA functions like ``cudaMemcpy`` and
``cudaLaunchKernel``::

$ CALI_CONFIG=runtime-report,profile.cuda lrun -n 4 ./tea_leaf
Path Min time/rank Max time/rank Avg time/rank Time %
timestep_loop 0.000175 0.000791 0.000345 0.002076
summary 0.000129 0.000153 0.000140 0.000846
cudaMemcpy 0.000414 0.000422 0.000418 0.002516
cudaLaunchKernel 0.000247 0.000273 0.000256 0.001542
total_solve 0.000105 0.000689 0.000252 0.001516
reset 0.000160 0.000179 0.000169 0.001021
internal_halo_update 0.000027 0.000029 0.000028 0.000167
halo_update 0.000028 0.000031 0.000029 0.000176
halo_exchange 0.000496 0.000812 0.000607 0.003654
cudaMemcpy 0.001766 0.001792 0.001779 0.010713
cudaLaunchKernel 0.000418 0.000450 0.000430 0.002588
cudaLaunchKernel 0.000097 0.000126 0.000106 0.000640
...

Profiling CUDA activities
---------------------------------------

The `cuda-activity-report` and `cuda-activity-profile` configs record the time
spent in CUDA activities (e.g. kernel executions or memory copies) on the CUDA
device. The GPU times are mapped to the Caliper regions that launched those
GPU activities. Here is example output for `cuda-activity-report`::

$ CALI_CONFIG=cuda-activity-report lrun -n 4 ./tea_leaf
Path Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU %
timestep_loop 17.052183 17.053008 12.020218 12.037409 70.490786
total_solve 17.048899 17.049464 12.019799 12.036989 70.501909
solve 17.026991 17.027000 12.009868 12.027047 70.534294
dot_product 0.004557 0.006057
cudaMalloc 0.000050 0.000055
internal_halo_update 0.051974 0.052401
halo_update 0.110023 0.111015
halo_exchange 15.154157 15.184635 0.621857 0.625657 4.103542
cudaMemcpy 12.398854 12.422384 0.234947 0.235587 1.894913
cudaLaunchKernel 1.426069 1.440035 0.386910 0.390767 27.131206
cudaMemcpy 0.497959 0.503879 0.003377 0.003433 0.678115
cudaLaunchKernel 0.772027 0.793401 11.384634 11.402917 1474.641141

For each Caliper region, we now see the time spent on the CPU ("Avg/Max Host
Time") and the aggregate time on the GPU for activities launched from
this region ("Avg/Max GPU Time"). Note how the ``cudaLaunchKernel`` call in
the last row of the output has ~0.78 seconds of CPU time associated with it,
but ~11.38 seconds of GPU time - these 11.38 seconds reflect the amount of GPU
activities launched asynchronously from this region. Most CPU time in the
example output is actually spent in ``cudaMemcpy`` under ``halo_exchange``
- we can deduce that much of this is actually synchronization time spent
waiting until the GPU kernels are complete.

The "GPU %" metric shows the fraction of (wall-clock) CPU time in which there
are GPU activities, and represents GPU utilization inside a Caliper region.
The percentage is based on the total inclusive CPU time and GPU activities for
the region and it sub-regions. The "GPU %" for the top-level region
(``timestep_loop``) represents the GPU utilization of the entire program.

The definition of the reported metrics is as follows:

Avg Host Time
Inclusive time (seconds) in the Caliper region on the Host (CPU).
Typically the wall-clock time. Average value across MPI ranks.

Max Host Time
Inclusive time (seconds) in the Caliper region on the Host (CPU).
Typically the wall-clock time. Maximum value among all MPI ranks.

Avg GPU Time
Inclusive total time (seconds) of activities executing on the GPU
launched from the Caliper region. Average across MPI ranks.

Max GPU Time
Inclusive total time (seconds) of activities executing on the GPU
launched from the Caliper region. Maximum value among all MPI ranks.

GPU %
Fraction of total inclusive GPU time vs. CPU time. Typically
represents the GPU utilization in the Caliper region.

Show GPU kernels
---------------------------------------

Use the `show_kernels` option in `cuda-activity-report` to distinguish
individual CUDA kernels::

$ CALI_CONFIG=cuda-activity-report,show_kernels lrun -n 4 ./tea_leaf
Path Kernel Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU %
timestep_loop
|- 17.068956 17.069917 0.239392 0.240725 1.402501
|- device_unpack_top_buffe~~le*, double*, int, int) 0.091051 0.092734
|- device_tea_leaf_ppcg_so~~ const*, double const*) 5.409844 5.419096
|- device_tea_leaf_ppcg_so~~t*, double const*, int) 5.316101 5.320777
|- device_pack_right_buffe~~le*, double*, int, int) 0.112455 0.113198
|- device_pack_top_buffer(~~le*, double*, int, int) 0.092634 0.092820
(...)
|- device_pack_bottom_buff~~le*, double*, int, int) 0.098929 0.099095
summary
|- 0.000881 0.000964 0.000010 0.000011 1.179024
|- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326
|- void reduction<double, ~~N_TYPE)0>(int, double*) 0.000083 0.000084
cudaMemcpy 0.000437 0.000457 0.000010 0.000011 2.376874
cudaLaunchKernel
|- 0.000324 0.000392
|- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326
|- void reduction<double, ~~N_TYPE)0>(int, double*) 0.000083 0.000084

We now see the GPU time in each kernel. The display is "inclusive", that is,
for each Caliper region, we see kernels launched from this region as well as
all regions below it. Under the top-level region (``timestep_loop``),
we see the total time spent in each CUDA kernel in the program.

Profiling memory copies
---------------------------------------

The `cuda.memcpy` option shows the amount of data copied between CPU and GPU
memory::

$ CALI_CONFIG=cuda-activity-report,cuda.memcpy lrun -n 4 ./tea_leaf
Path Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU % Copy CPU->GPU (avg) Copy CPU->GPU (max) Copy GPU->CPU (avg) Copy GPU->CPU (max)
timestep_loop 17.098644 17.100283 12.020492 12.040316 70.300846
total_solve 17.094882 17.095510 12.020072 12.039896 70.313864
solve 17.072250 17.072253 12.010138 12.029962 70.348891
dot_product 0.001489 0.001684
cudaMalloc 0.000050 0.000054
internal_halo_update 0.052739 0.053484
halo_update 0.110009 0.111008
halo_exchange 15.194774 15.220273 0.622893 0.629519 4.099388
cudaMemcpy 12.405958 12.430534 0.234917 0.235315 1.893578 902.517616 902.517616 902.469568 902.469568
cudaLaunchKernel 1.431387 1.463419 0.387976 0.394445 27.104925
cudaMemcpy 0.493047 0.494699 0.003369 0.003393 0.683394 0.040320 0.040320 0.019232 0.019232

In the example, there were about 902 Megabytes copied from CPU to GPU memory
and back. There are four new metrics:

Copy CPU->GPU (avg)
Data copied (Megabytes) from CPU to GPU memory. Average across MPI ranks.

Copy CPU->GPU (max)
Data copied (Megabytes) from CPU to GPU memory. Maximum among MPI ranks.

Copy GPU->CPU (avg)
Data copied (Megabytes) from GPU to CPU memory. Average across MPI ranks.

Copy GPU->CPU (max)
Data copied (Megabytes) from GPU to CPU memory. Maximum among MPI ranks.

Profiles in JSON or CALI format
---------------------------------------

The `cuda-activity-profile` config records data similar to
`cuda-activity-report` but writes it in machine-readable JSON or Caliper's
".cali" format for processing with `cali-query`. By default, it writes the
`json-split` format that can be read by Hatchet::

$ CALI_CONFIG=cuda-activity-profile lrun -n 4 ./tea_leaf
$ ls *.json
cuda_profile.json

Other formats can be selected with the `output.format` option. Possible
values:

cali
The Caliper profile/trace format for processing with `cali-query`.

hatchet
JSON format readable by Hatchet. See :ref:`json-split-format`.

json
Easy-to-parse list-of-dicts style JSON. See :ref:`json-format`.
3 changes: 3 additions & 0 deletions doc/sphinx/CaliperBasics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,9 @@ with `Hatchet <https://github.com/LLNL/hatchet>`_. See
:ref:`more-on-configurations` below to learn more about different
configurations and their options.


.. _notes_on_threading:

Notes on multi-threading
................................

Expand Down
161 changes: 161 additions & 0 deletions doc/sphinx/OpenMP.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
OpenMP profiling
================================

Caliper can profile OpenMP constructs with the help of the OpenMP tools
interface (OMPT). This requires a compiler with OMPT support, e.g. clang 9+.
Build Caliper with ``-DWITH_OMPT=On`` to enable it.

When OMPT support is enabled, Caliper provides the `openmp-report` built-in
config, as well as the `openmp.times` and `openmp.threads` options for configs
like `runtime-report` and `hatchet-region-profile`. With manual configurations,
you can use the :ref:`ompt-service` service.

OpenMP profiling with openmp-report
------------------------------------

The `openmp-report` config measures and prints the time spent in OpenMP
constructs on each thread::

$ CALI_CONFIG=openmp-report ./caliper-openmp-example
Path #Threads Time (thread) Time (total) Work % Barrier % Time (work) Time (barrier)
main 0.005122 0.027660 85.969388 14.030612
work 4 0.005110 0.027572 85.969388 14.030612 0.011121 0.001815

This shows example output for a program like the one in section
:ref:`omp-instrumentation`. For Caliper regions with active OpenMP parallel
regions inside them (like "work" in the example), the report prints the
total CPU time across all threads spent doing OpenMP work ("Time (work)") and
stuck in OpenMP barriers ("Time (barrier)"). It also computes "Work %" and
"Barrier %" metrics, which indicate the relative amount of time spent in work
and barrier regions, respectively. These two metrics are *inclusive*, so we
see the overall OpenMP efficiency for the entire program in the "main" region.
The definition of the metrics is as follows:

Path
The Caliper region hierarchy.

#Threads
Maximum number of OpenMP threads in the region.

Time (thread)
Time in seconds spent on the main thread (i.e., wall-clock time).

Time (total)
Sum of CPU time in seconds across all threads.

Work %
Percent of CPU time spent in OpenMP workshare regions vs. overall time in
OpenMP. This is an inclusive metric (e.g., aggregated over all child
regions).

Barrier %
Percent of CPU time spent in barriers vs. overall time in OpenMP. This is
also an inclusive metric.

Time (work)
CPU time spent in OpenMP workshare regions (e.g., ``#pragma omp for`` loops).

Time (barrier)
CPU time spent in OpenMP barriers (both implicit and explicit barriers).

Profiling by thread
.................................

With the "show_threads" option, we can see the times spent on each thread::

$ CALI_CONFIG=openmp-report,show_threads=1,show_regions=true ./caliper-openmp-example
Path #Threads Time (thread) Time (total) Thread Work % Barrier % Time (work) Time (barrier)
main
|- 0.000197 0.014312
|- 0.002747 3 100.000000
|- 0.003086 1 100.000000
|- 0.002856 2 100.000000
|- 0.005075 0.005075 0 62.539218 37.460782
work
|- 0.000180 0.014228
|- 4 0.002747 3 100.000000 0.002688
|- 4 0.003086 1 100.000000 0.003018
|- 4 0.002856 2 100.000000 0.002800
|- 4 0.005075 0.005075 0 62.539218 37.460782 0.002990 0.001791

We now see the additional "Thread" column with the OpenMP thread ID, and
"Time (total)" shows the CPU time in each OpenMP thread per Caliper region.
The rows without a "Thread" entry refer to time spent outside OpenMP parallel
regions.

Thread summary view
.................................

With "show_regions=false", we can disable the grouping by Caliper region and
just show a performance summary across OpenMP threads::

$ CALI_CONFIG=openmp-report,show_threads=true,show_regions=false
#Threads Time (thread) Time (total) Thread Work % Barrier % Time (work) Time (barrier)
0.000175 0.016727
4 0.002760 3 100.000000 0.002722
4 0.002958 1 100.000000 0.002901
4 0.002768 2 100.000000 0.002699
4 0.004913 0.004913 0 62.682091 37.317909 0.002926 0.00174

.. _omp-instrumentation:

OpenMP options for other configs
--------------------------------

Caliper also provides the OpenMP options for other profiling configs such as
`hatchet-region-profile`. These are:

openmp.times
Measure and return CPU time spent in OpenMP workshare and barrier regions.

openmp.efficiency
Compute the "Work %" and "Barrier %" metrics as in the openmp-report config
shown above.

openmp.threads
Group by thread (i.e., record metrics for each OpenMP thread separately),
as with the "show_threads" option for `openmp-report` shown above.

Instrumentation
--------------------------------

When instrumenting the code, make sure to use "process"-scope annotations
to mark code regions outside of OpenMP parallel regions, *or* put thread-scope
annotations onto every OpenMP thread. Mixing both approaches is possible but
not recommended, since it produces separate region hierarchies for the process
and thread scopes. See :ref:`notes_on_threading` for more information.

Use the ``CALI_CALIPER_ATTRIBUTE_DEFAULT_SCOPE`` config flag to define if
Caliper regions should use process or thread scope. They use thread scope
by default. The following example sets the attribute default scope to
"process" so that the "main" and "work" Caliper regions are visible on all
OpenMP threads inside the parallel region:

.. code-block:: c++

#include <caliper/cali.h>

int main()
{
cali_config_set("CALI_CALIPER_ATTRIBUTE_DEFAULT_SCOPE", "process");

CALI_MARK_BEGIN("main");
CALI_MARK_BEGIN("work");

#pragma omp parallel
{
// ...
}

CALI_MARK_END("work");
CALI_MARK_END("main");
}

Enabling OMPT
-------------------------------

Caliper enables the OpenMP tools interface automatically when the `ompt`
service is active. In some cases this can fail: this is often the case when
the OpenMP runtime is initialized before Caliper. In this case, set the
:envvar:`CALI_USE_OMPT` environment variable to "1" or "true" to enable
OpenMP support manually.
4 changes: 4 additions & 0 deletions doc/sphinx/OutputFormats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,8 @@ Output::
IntegrateStressForElems 100 358512
TimeIncrement 100 1280

.. _json-format:

Json
--------------------------------

Expand Down Expand Up @@ -171,6 +173,8 @@ Output::
{"loop":"lulesh.cycle","function":"main/LagrangeLeapFrog/LagrangeNodal/CalcForceForNodes","count":100,"time.inclusive.duration":1466051},
{"loop":"lulesh.cycle","function":"main/LagrangeLeapFrog/LagrangeNodal/CalcForceForNodes/CalcVolumeForceForElems","count":100,"time.inclusive.duration":1456613},

.. _json-split-format:

Json-split
--------------------------------

Expand Down
6 changes: 3 additions & 3 deletions doc/sphinx/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,9 @@ These variables can only be set as environment variables.
Set to "1" or "true" to activate the OpenMP tools interface in the OpenMP
runtime. This is required for the ompt service. It is only necessary to set
this if the OpenMP runtime is first initialized before the ompt service is
initialized, otherwise ompt will activate the OpenMP tools interface
automatically. When set to "0" or "false" Caliper will not use the
OpenMP tools interface. See :ref:`ompt <ompt-service>`
initialized, otherwise the `ompt` service will activate the OpenMP tools
interface automatically. When set to "0" or "false" Caliper will not use
the OpenMP tools interface. See :ref:`ompt <ompt-service>`

Configuration variables
........................................
Expand Down
Loading

0 comments on commit 903c2ba

Please sign in to comment.