Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/nlg550/ZPIC_OmpSs2
Browse files Browse the repository at this point in the history
  • Loading branch information
nlg550 committed Jun 29, 2021
2 parents 6efde45 + 957ce6f commit 55c6e46
Showing 1 changed file with 32 additions and 31 deletions.
63 changes: 32 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Parallel ZPIC

[ZPIC](https://github.com/ricardo-fonseca/zpic) is a sequential 2D EM-PIC kinetic plasma simulator based on OSIRIS [1], implementing the same core algorithm and features. From the ZPIC code (em2d variant), we developed several parallel versions to explore tasking ([OmpSs-2](https://pm.bsc.es/ompss-2)) and emerging platforms ([OpenACC](https://www.openacc.org/)/GPUs).
[ZPIC](https://github.com/ricardo-fonseca/zpic) is a sequential 2D EM-PIC kinetic plasma simulator based on OSIRIS [1], implementing the same core algorithm and features. From the ZPIC code (em2d variant), we developed several parallel versions to explore tasking ([OmpSs-2](https://pm.bsc.es/ompss-2)) and emerging platforms (GPUs with [OpenACC](https://www.openacc.org/)).

## Parallelization Strategy and Features

### General Strategy

In all parallel versions, the simulation space is split into multiple regions alongside the y axis (i.e., a row-wise decomposition). Each region stores both the particles inside it and the fraction of the grid they interact with, allowing both the particle advance and field integration to be performed locally. However, particles can exit their associated regions and must be transferred to their new location. Each region must also be padded with ghost cells, so that the thread processing the region can access grid quantities (current and EM fields) outside its boundaries. With a spatial decomposition, the ZPIC algorithm becomes:
In all parallel versions, the simulation space is split into multiple regions alongside the y axis (i.e., a row-wise decomposition). Each region stores both the particles inside it and the fraction of the grid they interact with, allowing both the particle advance and field integration to be performed locally. However, particles can exit their associated regions and must be transferred to their new location. Each region must also be padded with ghost cells, so that the thread processing the region can access grid quantities (current and EM fields) outside its boundaries. With this decomposition, the ZPIC algorithm becomes:

```
function particle_advance(region)
Expand All @@ -16,22 +16,20 @@ function particle_advance(region)
old_pos = particle.pos
particle_push(particle)
deposit_current(region.J, old_pos, particle.pos)
check_boundaries(particle, region.outgoing_part)
check_exiting_part(particle, region.outgoing_part)
endfor
if moving_window is enable then
shift_left(region.particles)
insert_new_particles(region)
shift_part_left(region.particles)
inject_new_particles(region)
endif
endfunction
for each time_step do
for each region in simulation parallel do
current_zero(region.J)
particle_advance(region)
merge_buffers(region.particles)
add_incoming_part(region.particles, region.incoming_part)
update_gc_add(region.J, region.neighbours.J)
if region.J_filter is enable then
Expand All @@ -53,40 +51,40 @@ for each time_step do
endfor
```

#### OmpSs-2 (`zpic-reduction-async`):
- All simulation stages are defined as tasks
- Tasks are synchronized exclusively by data dependencies
- Fully asynchronous execution
- Dynamic load balancing (overdecomposition + dynamic task scheduling)

For more details, please check our upcoming paper in EuroPar2021. The pre-print version is available in [ArXiv](https://arxiv.org/abs/2106.12485).
For more details, please check our upcoming paper in EuroPar2021. The pre-print version is available in [ArXiv](https://arxiv.org/abs/2106.12485). The `ompss2` version in this repository corresponds to the `zpic-reduction-async` variant in the EuroPar2021 paper.

### NVIDIA GPUs (OpenACC)
- Spatial decomposition (see General Strategy)
- Each region is organized in tiles (16x16 cells) and the particles are sorted based on the tile their are located in. Every time step, the program executes a modified bucket sort (adapted from [2, 3]) to preserve data locality (associating the particle with the tile their located in).
- During the particle advance, each tile are mapped to a SM. Each SM then load the local EM fields to Shared Memory, advances the particles within the tile, deposit the current generated atomically in a local buffer and then updates the global electric current buffer with the local values.
- Particles are stored as a Structure of Arrays (SoA) for accessing the global memory in coalesced fashion
- Data management is handled by NVIDIA Unified Memory
- Each region is organized in tiles (16x16 cells) and the particles within the region are sorted based on the tile their are located in. Every time step, the program executes a modified bucket sort (adapted from [2, 3]) to preserve data locality (associating the particle with the tile their located in).
- During the particle advance, each tile are mapped to a SM. Each SM then load the local EM fields to Shared Memory, advances the particles within the tile, deposit the current generated atomically in a local buffer and then updates the global electric current buffer with the local values.
- Support for multi-GPUs systems

### OpenACC:
- OpenMP as management layer: launching kernels, synchronizing devices, etc.
- Synchronous execution
### Features:
#### OmpSs-2:
- Simulation stages defined as OmpSs-2 tasks
- Tasks are synchronized through data dependencies
- Fully asynchronous execution
- Dynamic load balancing (overdecomposition + dynamic task scheduling)

#### OpenACC:
- Uses OpenMP for launching kernels in multiple devices, synchronizing their execution, etc.
- Prefetch routines to move data between devices to avoid page faults.

### OmpSs-2 + OpenACC:
- OmpSs-2 as management layer
#### OmpSs-2 + OpenACC:
- Uses OmpSs-2 as management layer
- OpenACC kernels incorporated as OmpSs tasks
- Asynchronous queues/streams for kernel overlapping
- Fully asynchronous execution
- (Deprecated) Hybrid execution (CPU + GPU)

## Plasma Experiments / Input
Please check for the [ZPIC documentation](https://github.com/ricardo-fonseca/zpic/blob/master/doc/Documentation.md) for more information for setting up the simulation parameters. In this repository, there are two simulation already included: LWFA and Weibel Instability.
Please check for the [ZPIC documentation](https://github.com/ricardo-fonseca/zpic/blob/master/doc/Documentation.md) for more information for setting up the simulation parameters. In all versions, there are two included simulation: LWFA (Laser Wakefield Acceleration) and Weibel (Instability).

For organization purpose, each file is named after the simulation parameters according to the following naming scheme:
For organization purpose, each file is named after the simulation parameters according to the following scheme:
```
experiment type - number of time steps - number of particles per species - grid size x - grid size y
<experiment type> - <number of time steps> - <number of particles per species> - <grid size x> - <grid size y>
```

## Output
Expand All @@ -107,20 +105,22 @@ OpenACC:
- PGI Compiler 19.10 or newer (later renamed as NVIDIA HPC SDK)
- CUDA v9.0 or newer
- Pascal or newer GPUs
- [Nanos6 Runtime (experimental)](https://github.com/epeec/nanos6-openacc) (get-queue-affinity branch)
- With OmpSs-2, use this experimental version of the [Nanos6 Runtime](https://github.com/epeec/nanos6-openacc) (get-queue-affinity branch)


### Compilation Flags

```
-DTEST Print the simulation timing and other information in a CSV friendly format. Disable all reporting and other terminal outputs
```
```
-DENABLE_ADVISE Enable CUDA MemAdvise routines to guide the Unified Memory System. Any OpenACC versions.
```
```
-DENABLE_PREFETCH (or make prefetch) Enable CUDA MemPrefetch routines (experimental)
```
```
-DENABLE_AFFINITY (or make affinity) Enable the use of device affinity (schedule tasks based on the data location). Otherwise, Nanos6 runtime only uses 1 GPU. OmpSs@OpenACC version only.
```

### Commands
Expand All @@ -138,3 +138,4 @@ make <option> -j8

[3] F. Hariri et al., ‘A portable platform for accelerated PIC codes and its application to GPUs using OpenACC’, Computer Physics Communications, vol. 207, pp. 69–82, Oct. 2016, doi: 10.1016/j.cpc.2016.05.008.


0 comments on commit 55c6e46

Please sign in to comment.