forked from LLNL/Caliper
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add collective-output-channel example code * Fix OpenMP metrics * Add OpenMP profiling documentation * Documentation updates; add CUDA profiling howto [skip ci]
- Loading branch information
Showing
11 changed files
with
478 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
CUDA profiling | ||
======================================= | ||
|
||
Caliper can profile CUDA API functions and CUDA device-side activities like | ||
kernel executions and memory copies. This requires Caliper to be built with | ||
CUpti support (`-DWITH_CUPTI=On`). | ||
|
||
Profiling CUDA host-side API functions | ||
--------------------------------------- | ||
|
||
The `profile.cuda` option is available for many ConfigManager configs like | ||
`runtime-report`, `hatchet-region-profile`, and `spot`. It intercepts host-side | ||
CUDA API functions like ``cudaMemcpy`` and ``cudaDeviceSynchronize``. | ||
The example below shows `runtime-report` output with the `profile.cuda` option | ||
enabled. We can see the time spent in CUDA functions like ``cudaMemcpy`` and | ||
``cudaLaunchKernel``:: | ||
|
||
$ CALI_CONFIG=runtime-report,profile.cuda lrun -n 4 ./tea_leaf | ||
Path Min time/rank Max time/rank Avg time/rank Time % | ||
timestep_loop 0.000175 0.000791 0.000345 0.002076 | ||
summary 0.000129 0.000153 0.000140 0.000846 | ||
cudaMemcpy 0.000414 0.000422 0.000418 0.002516 | ||
cudaLaunchKernel 0.000247 0.000273 0.000256 0.001542 | ||
total_solve 0.000105 0.000689 0.000252 0.001516 | ||
reset 0.000160 0.000179 0.000169 0.001021 | ||
internal_halo_update 0.000027 0.000029 0.000028 0.000167 | ||
halo_update 0.000028 0.000031 0.000029 0.000176 | ||
halo_exchange 0.000496 0.000812 0.000607 0.003654 | ||
cudaMemcpy 0.001766 0.001792 0.001779 0.010713 | ||
cudaLaunchKernel 0.000418 0.000450 0.000430 0.002588 | ||
cudaLaunchKernel 0.000097 0.000126 0.000106 0.000640 | ||
... | ||
|
||
Profiling CUDA activities | ||
--------------------------------------- | ||
|
||
The `cuda-activity-report` and `cuda-activity-profile` configs record the time | ||
spent in CUDA activities (e.g. kernel executions or memory copies) on the CUDA | ||
device. The GPU times are mapped to the Caliper regions that launched those | ||
GPU activities. Here is example output for `cuda-activity-report`:: | ||
|
||
$ CALI_CONFIG=cuda-activity-report lrun -n 4 ./tea_leaf | ||
Path Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU % | ||
timestep_loop 17.052183 17.053008 12.020218 12.037409 70.490786 | ||
total_solve 17.048899 17.049464 12.019799 12.036989 70.501909 | ||
solve 17.026991 17.027000 12.009868 12.027047 70.534294 | ||
dot_product 0.004557 0.006057 | ||
cudaMalloc 0.000050 0.000055 | ||
internal_halo_update 0.051974 0.052401 | ||
halo_update 0.110023 0.111015 | ||
halo_exchange 15.154157 15.184635 0.621857 0.625657 4.103542 | ||
cudaMemcpy 12.398854 12.422384 0.234947 0.235587 1.894913 | ||
cudaLaunchKernel 1.426069 1.440035 0.386910 0.390767 27.131206 | ||
cudaMemcpy 0.497959 0.503879 0.003377 0.003433 0.678115 | ||
cudaLaunchKernel 0.772027 0.793401 11.384634 11.402917 1474.641141 | ||
|
||
For each Caliper region, we now see the time spent on the CPU ("Avg/Max Host | ||
Time") and the aggregate time on the GPU for activities launched from | ||
this region ("Avg/Max GPU Time"). Note how the ``cudaLaunchKernel`` call in | ||
the last row of the output has ~0.78 seconds of CPU time associated with it, | ||
but ~11.38 seconds of GPU time - these 11.38 seconds reflect the amount of GPU | ||
activities launched asynchronously from this region. Most CPU time in the | ||
example output is actually spent in ``cudaMemcpy`` under ``halo_exchange`` | ||
- we can deduce that much of this is actually synchronization time spent | ||
waiting until the GPU kernels are complete. | ||
|
||
The "GPU %" metric shows the fraction of (wall-clock) CPU time in which there | ||
are GPU activities, and represents GPU utilization inside a Caliper region. | ||
The percentage is based on the total inclusive CPU time and GPU activities for | ||
the region and it sub-regions. The "GPU %" for the top-level region | ||
(``timestep_loop``) represents the GPU utilization of the entire program. | ||
|
||
The definition of the reported metrics is as follows: | ||
|
||
Avg Host Time | ||
Inclusive time (seconds) in the Caliper region on the Host (CPU). | ||
Typically the wall-clock time. Average value across MPI ranks. | ||
|
||
Max Host Time | ||
Inclusive time (seconds) in the Caliper region on the Host (CPU). | ||
Typically the wall-clock time. Maximum value among all MPI ranks. | ||
|
||
Avg GPU Time | ||
Inclusive total time (seconds) of activities executing on the GPU | ||
launched from the Caliper region. Average across MPI ranks. | ||
|
||
Max GPU Time | ||
Inclusive total time (seconds) of activities executing on the GPU | ||
launched from the Caliper region. Maximum value among all MPI ranks. | ||
|
||
GPU % | ||
Fraction of total inclusive GPU time vs. CPU time. Typically | ||
represents the GPU utilization in the Caliper region. | ||
|
||
Show GPU kernels | ||
--------------------------------------- | ||
|
||
Use the `show_kernels` option in `cuda-activity-report` to distinguish | ||
individual CUDA kernels:: | ||
|
||
$ CALI_CONFIG=cuda-activity-report,show_kernels lrun -n 4 ./tea_leaf | ||
Path Kernel Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU % | ||
timestep_loop | ||
|- 17.068956 17.069917 0.239392 0.240725 1.402501 | ||
|- device_unpack_top_buffe~~le*, double*, int, int) 0.091051 0.092734 | ||
|- device_tea_leaf_ppcg_so~~ const*, double const*) 5.409844 5.419096 | ||
|- device_tea_leaf_ppcg_so~~t*, double const*, int) 5.316101 5.320777 | ||
|- device_pack_right_buffe~~le*, double*, int, int) 0.112455 0.113198 | ||
|- device_pack_top_buffer(~~le*, double*, int, int) 0.092634 0.092820 | ||
(...) | ||
|- device_pack_bottom_buff~~le*, double*, int, int) 0.098929 0.099095 | ||
summary | ||
|- 0.000881 0.000964 0.000010 0.000011 1.179024 | ||
|- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326 | ||
|- void reduction<double, ~~N_TYPE)0>(int, double*) 0.000083 0.000084 | ||
cudaMemcpy 0.000437 0.000457 0.000010 0.000011 2.376874 | ||
cudaLaunchKernel | ||
|- 0.000324 0.000392 | ||
|- device_field_summary_ke~~ble*, double*, double*) 0.000325 0.000326 | ||
|- void reduction<double, ~~N_TYPE)0>(int, double*) 0.000083 0.000084 | ||
|
||
We now see the GPU time in each kernel. The display is "inclusive", that is, | ||
for each Caliper region, we see kernels launched from this region as well as | ||
all regions below it. Under the top-level region (``timestep_loop``), | ||
we see the total time spent in each CUDA kernel in the program. | ||
|
||
Profiling memory copies | ||
--------------------------------------- | ||
|
||
The `cuda.memcpy` option shows the amount of data copied between CPU and GPU | ||
memory:: | ||
|
||
$ CALI_CONFIG=cuda-activity-report,cuda.memcpy lrun -n 4 ./tea_leaf | ||
Path Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU % Copy CPU->GPU (avg) Copy CPU->GPU (max) Copy GPU->CPU (avg) Copy GPU->CPU (max) | ||
timestep_loop 17.098644 17.100283 12.020492 12.040316 70.300846 | ||
total_solve 17.094882 17.095510 12.020072 12.039896 70.313864 | ||
solve 17.072250 17.072253 12.010138 12.029962 70.348891 | ||
dot_product 0.001489 0.001684 | ||
cudaMalloc 0.000050 0.000054 | ||
internal_halo_update 0.052739 0.053484 | ||
halo_update 0.110009 0.111008 | ||
halo_exchange 15.194774 15.220273 0.622893 0.629519 4.099388 | ||
cudaMemcpy 12.405958 12.430534 0.234917 0.235315 1.893578 902.517616 902.517616 902.469568 902.469568 | ||
cudaLaunchKernel 1.431387 1.463419 0.387976 0.394445 27.104925 | ||
cudaMemcpy 0.493047 0.494699 0.003369 0.003393 0.683394 0.040320 0.040320 0.019232 0.019232 | ||
|
||
In the example, there were about 902 Megabytes copied from CPU to GPU memory | ||
and back. There are four new metrics: | ||
|
||
Copy CPU->GPU (avg) | ||
Data copied (Megabytes) from CPU to GPU memory. Average across MPI ranks. | ||
|
||
Copy CPU->GPU (max) | ||
Data copied (Megabytes) from CPU to GPU memory. Maximum among MPI ranks. | ||
|
||
Copy GPU->CPU (avg) | ||
Data copied (Megabytes) from GPU to CPU memory. Average across MPI ranks. | ||
|
||
Copy GPU->CPU (max) | ||
Data copied (Megabytes) from GPU to CPU memory. Maximum among MPI ranks. | ||
|
||
Profiles in JSON or CALI format | ||
--------------------------------------- | ||
|
||
The `cuda-activity-profile` config records data similar to | ||
`cuda-activity-report` but writes it in machine-readable JSON or Caliper's | ||
".cali" format for processing with `cali-query`. By default, it writes the | ||
`json-split` format that can be read by Hatchet:: | ||
|
||
$ CALI_CONFIG=cuda-activity-profile lrun -n 4 ./tea_leaf | ||
$ ls *.json | ||
cuda_profile.json | ||
|
||
Other formats can be selected with the `output.format` option. Possible | ||
values: | ||
|
||
cali | ||
The Caliper profile/trace format for processing with `cali-query`. | ||
|
||
hatchet | ||
JSON format readable by Hatchet. See :ref:`json-split-format`. | ||
|
||
json | ||
Easy-to-parse list-of-dicts style JSON. See :ref:`json-format`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
OpenMP profiling | ||
================================ | ||
|
||
Caliper can profile OpenMP constructs with the help of the OpenMP tools | ||
interface (OMPT). This requires a compiler with OMPT support, e.g. clang 9+. | ||
Build Caliper with ``-DWITH_OMPT=On`` to enable it. | ||
|
||
When OMPT support is enabled, Caliper provides the `openmp-report` built-in | ||
config, as well as the `openmp.times` and `openmp.threads` options for configs | ||
like `runtime-report` and `hatchet-region-profile`. With manual configurations, | ||
you can use the :ref:`ompt-service` service. | ||
|
||
OpenMP profiling with openmp-report | ||
------------------------------------ | ||
|
||
The `openmp-report` config measures and prints the time spent in OpenMP | ||
constructs on each thread:: | ||
|
||
$ CALI_CONFIG=openmp-report ./caliper-openmp-example | ||
Path #Threads Time (thread) Time (total) Work % Barrier % Time (work) Time (barrier) | ||
main 0.005122 0.027660 85.969388 14.030612 | ||
work 4 0.005110 0.027572 85.969388 14.030612 0.011121 0.001815 | ||
|
||
This shows example output for a program like the one in section | ||
:ref:`omp-instrumentation`. For Caliper regions with active OpenMP parallel | ||
regions inside them (like "work" in the example), the report prints the | ||
total CPU time across all threads spent doing OpenMP work ("Time (work)") and | ||
stuck in OpenMP barriers ("Time (barrier)"). It also computes "Work %" and | ||
"Barrier %" metrics, which indicate the relative amount of time spent in work | ||
and barrier regions, respectively. These two metrics are *inclusive*, so we | ||
see the overall OpenMP efficiency for the entire program in the "main" region. | ||
The definition of the metrics is as follows: | ||
|
||
Path | ||
The Caliper region hierarchy. | ||
|
||
#Threads | ||
Maximum number of OpenMP threads in the region. | ||
|
||
Time (thread) | ||
Time in seconds spent on the main thread (i.e., wall-clock time). | ||
|
||
Time (total) | ||
Sum of CPU time in seconds across all threads. | ||
|
||
Work % | ||
Percent of CPU time spent in OpenMP workshare regions vs. overall time in | ||
OpenMP. This is an inclusive metric (e.g., aggregated over all child | ||
regions). | ||
|
||
Barrier % | ||
Percent of CPU time spent in barriers vs. overall time in OpenMP. This is | ||
also an inclusive metric. | ||
|
||
Time (work) | ||
CPU time spent in OpenMP workshare regions (e.g., ``#pragma omp for`` loops). | ||
|
||
Time (barrier) | ||
CPU time spent in OpenMP barriers (both implicit and explicit barriers). | ||
|
||
Profiling by thread | ||
................................. | ||
|
||
With the "show_threads" option, we can see the times spent on each thread:: | ||
|
||
$ CALI_CONFIG=openmp-report,show_threads=1,show_regions=true ./caliper-openmp-example | ||
Path #Threads Time (thread) Time (total) Thread Work % Barrier % Time (work) Time (barrier) | ||
main | ||
|- 0.000197 0.014312 | ||
|- 0.002747 3 100.000000 | ||
|- 0.003086 1 100.000000 | ||
|- 0.002856 2 100.000000 | ||
|- 0.005075 0.005075 0 62.539218 37.460782 | ||
work | ||
|- 0.000180 0.014228 | ||
|- 4 0.002747 3 100.000000 0.002688 | ||
|- 4 0.003086 1 100.000000 0.003018 | ||
|- 4 0.002856 2 100.000000 0.002800 | ||
|- 4 0.005075 0.005075 0 62.539218 37.460782 0.002990 0.001791 | ||
|
||
We now see the additional "Thread" column with the OpenMP thread ID, and | ||
"Time (total)" shows the CPU time in each OpenMP thread per Caliper region. | ||
The rows without a "Thread" entry refer to time spent outside OpenMP parallel | ||
regions. | ||
|
||
Thread summary view | ||
................................. | ||
|
||
With "show_regions=false", we can disable the grouping by Caliper region and | ||
just show a performance summary across OpenMP threads:: | ||
|
||
$ CALI_CONFIG=openmp-report,show_threads=true,show_regions=false | ||
#Threads Time (thread) Time (total) Thread Work % Barrier % Time (work) Time (barrier) | ||
0.000175 0.016727 | ||
4 0.002760 3 100.000000 0.002722 | ||
4 0.002958 1 100.000000 0.002901 | ||
4 0.002768 2 100.000000 0.002699 | ||
4 0.004913 0.004913 0 62.682091 37.317909 0.002926 0.00174 | ||
|
||
.. _omp-instrumentation: | ||
|
||
OpenMP options for other configs | ||
-------------------------------- | ||
|
||
Caliper also provides the OpenMP options for other profiling configs such as | ||
`hatchet-region-profile`. These are: | ||
|
||
openmp.times | ||
Measure and return CPU time spent in OpenMP workshare and barrier regions. | ||
|
||
openmp.efficiency | ||
Compute the "Work %" and "Barrier %" metrics as in the openmp-report config | ||
shown above. | ||
|
||
openmp.threads | ||
Group by thread (i.e., record metrics for each OpenMP thread separately), | ||
as with the "show_threads" option for `openmp-report` shown above. | ||
|
||
Instrumentation | ||
-------------------------------- | ||
|
||
When instrumenting the code, make sure to use "process"-scope annotations | ||
to mark code regions outside of OpenMP parallel regions, *or* put thread-scope | ||
annotations onto every OpenMP thread. Mixing both approaches is possible but | ||
not recommended, since it produces separate region hierarchies for the process | ||
and thread scopes. See :ref:`notes_on_threading` for more information. | ||
|
||
Use the ``CALI_CALIPER_ATTRIBUTE_DEFAULT_SCOPE`` config flag to define if | ||
Caliper regions should use process or thread scope. They use thread scope | ||
by default. The following example sets the attribute default scope to | ||
"process" so that the "main" and "work" Caliper regions are visible on all | ||
OpenMP threads inside the parallel region: | ||
|
||
.. code-block:: c++ | ||
|
||
#include <caliper/cali.h> | ||
|
||
int main() | ||
{ | ||
cali_config_set("CALI_CALIPER_ATTRIBUTE_DEFAULT_SCOPE", "process"); | ||
|
||
CALI_MARK_BEGIN("main"); | ||
CALI_MARK_BEGIN("work"); | ||
|
||
#pragma omp parallel | ||
{ | ||
// ... | ||
} | ||
|
||
CALI_MARK_END("work"); | ||
CALI_MARK_END("main"); | ||
} | ||
|
||
Enabling OMPT | ||
------------------------------- | ||
|
||
Caliper enables the OpenMP tools interface automatically when the `ompt` | ||
service is active. In some cases this can fail: this is often the case when | ||
the OpenMP runtime is initialized before Caliper. In this case, set the | ||
:envvar:`CALI_USE_OMPT` environment variable to "1" or "true" to enable | ||
OpenMP support manually. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.