Skip to content
Open
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 63 additions & 51 deletions Docs/sphinx_documentation/source/GPU.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,16 @@

.. _sec:gpu:overview:

Overview of AMReX GPU Strategy
==============================
Overview of AMReX GPU Support
=============================

AMReX's GPU strategy focuses on providing performant GPU support
with minimal changes and maximum flexibility. This allows
application teams to get running on GPUs quickly while allowing
long term performance tuning and programming model selection. AMReX
uses the native programming language for GPUs: CUDA for NVIDIA, HIP
AMReX's GPU support focuses on providing performance portability
with minimal code changes required at the application level. This allows
application teams to use a single, maintainable codebase that works
on a variety of platforms while allowing for the performance tuning of specific,
high-impact kernels if desired.

Internally, AMReX uses the native programming languages for GPUs: CUDA for NVIDIA, HIP
for AMD and SYCL for Intel. This will be designated with ``CUDA/HIP/SYCL``
throughout the documentation. However, application teams can also use
OpenACC or OpenMP in their individual codes.
Expand All @@ -22,33 +24,24 @@ At this time, AMReX does not support cross-native language compilation
(HIP for non-AMD systems and SYCL for non Intel systems). It may work with
a given version, but AMReX does not track or guarantee such functionality.

When running AMReX on a CPU system, the parallelization strategy is a
combination of MPI and OpenMP using tiling, as detailed in
:ref:`sec:basics:mfiter:tiling`. However, tiling is ineffective on GPUs
due to the overhead associated with kernel launching. Instead,
efficient use of the GPU's resources is the primary concern. Improving
resource efficiency allows a larger percentage of GPU threads to work
simultaneously, increasing effective parallelism and decreasing the time
to solution.

When running on CPUs, AMReX uses an ``MPI+X`` strategy where the ``X``
threads are used to perform parallelization techniques, like tiling.
The most common ``X`` is ``OpenMP``. On GPUs, AMReX requires ``CUDA/HIP/SYCL``
and can be further combined with other parallel GPU languages, including
``OpenACC`` and ``OpenMP``, to control the offloading of subroutines
to the GPU. This ``MPI+X+Y`` GPU strategy has been developed
to give users the maximum flexibility to find the best combination of
portability, readability and performance for their applications.
AMReX uses an ``MPI+X`` approach to hierarchical parallelism. When running on
CPUs, ``X`` is ``OpenMP``, and threads are used to process tiles assigned to the
same MPI rank concurrently, as detailed in :ref:`sec:basics:mfiter:tiling`. On GPUs,
``X`` is one of ``CUDA/HIP/SYCL``, and tiling is disabled by default
to mitigate the overhead associated with kernel launching. Instead, one or more cells
in a a given ``Box`` are mapped to a given GPU thread, as detailed in :numref:`fig:gpu:threads`
below.

Presented here is an overview of important features of AMReX's GPU strategy.
Additional information that is required for creating GPU applications is
detailed throughout the rest of this chapter:

- Each MPI rank offloads its work to a single GPU. ``(MPI ranks == Number of GPUs)``
- Each MPI rank offloads its work to a single GPU. Multiple ranks can share the
same device, but for best performance we usually recommend ``(MPI ranks == Number of GPUs)``.

- Calculations that can be offloaded efficiently to GPUs use GPU threads
to parallelize over a valid box at a time. This is done by launching over
a large number GPU threads that only work on a few cells each. This work
- GPU kernels are launched through ``ParallelFor`` looping constructs that use GPU extended lambdas,
providing performance portability. When compiled with GPU support, these constructs launch
kernels with a large number GPU threads that only work on a few cells each. This work
distribution is illustrated in :numref:`fig:gpu:threads`.

.. |a| image:: ./GPU/gpu_2.png
Expand All @@ -70,31 +63,26 @@ detailed throughout the rest of this chapter:
| The lo and hi of one tiled box are marked. | thread, each thread using a box with lo = hi. |
+-----------------------------------------------------+------------------------------------------------------+

- C++ macros and GPU extended lambdas are used to provide performance
portability while making the code as understandable as possible to
science-focused code teams.
- These kernels are usually launched inside AMReX's :cpp:`MFIter` and :cpp:`ParIter`
loops, since in AMReX's approach to parallelism it is assumed that separate ``Box`` objects
can be processed independently. However, AMReX also provides a ``MultiFab`` version
of ``ParallelFor`` that can process and entire level worth of ``Box`` objects in
a single kernel launch when it is safe to do so.

- AMReX can utilize GPU managed memory to automatically handle memory
movement for mesh and particle data. Simple data structures, such
as :cpp:`IntVect`\s can be passed by value and complex data structures, such as
:cpp:`FArrayBox`\es, have specialized AMReX classes to handle the
data movement for the user. Tests have shown CUDA managed memory
to be efficient and reliable, especially when applications remove
any unnecessary data accesses. However, managed memory is not used by
data movement for the user. This particularly useful for the early stages
of porting an application to GPUs. However, for best performance on a
variety of platforms, we recommend disabling managed memory and handling
host/device data migration explicitly. managed memory is not used by
:cpp:`FArrayBox` and :cpp:`MultiFab` by default.

- Application teams should strive to keep mesh and particle data structures
- Best performance is usually achieved when keeping mesh and particle data structures
on the GPU for as long as possible, minimizing movement back to the CPU.
This strategy lends itself to AMReX applications readily; the mesh and
particle data can stay on the GPU for most subroutines except for
of redistribution, communication and I/O operations.

- AMReX's GPU strategy is focused on launching GPU kernels inside AMReX's
:cpp:`MFIter` and :cpp:`ParIter` loops. By performing GPU work within
:cpp:`MFIter` and :cpp:`ParIter` loops, GPU work is isolated to independent
data sets on well-established AMReX data objects, providing consistency and safety
that also matches AMReX's coding methodology. Similar tools are also available for
launching work outside of AMReX loops.
In many AMReX applications, the mesh and particle data can stay on the GPU for most
subroutines except for I/O operations.

- AMReX further parallelizes GPU applications by utilizing streams.
Streams guarantee execution order of kernels within the same stream, while
Expand Down Expand Up @@ -613,7 +601,7 @@ SUNDIALS CUDA vector:
GPU Safe Classes and Functions
==============================

AMReX GPU work takes place inside of MFIter and particle loops.
AMReX GPU work takes place inside of MFIter and ParIter loops.
Therefore, there are two ways classes and functions have been modified
to interact with the GPU:

Expand All @@ -624,7 +612,7 @@ such as :cpp:`amrex::min` and :cpp:`amrex::max`. In specialized cases,
classes are labeled such that the object can be constructed, destructed
and its functions can be implemented on the device, including ``IntVect``.

2. Functions that contain MFIter or particle loops have been rewritten
2. Functions that contain MFIter or ParIter loops have been rewritten
to contain device launches. For example, the :cpp:`FillBoundary`
function cannot be called from device code, but calling it from
CPU will launch GPU kernels if AMReX is compiled with GPU support.
Expand Down Expand Up @@ -1600,11 +1588,35 @@ Particle Support

.. _sec:gpu:particle:

As with ``MultiFab``, particle data stored in AMReX ``ParticleContainer`` classes are
stored in GPU memory when AMReX is compiled with ``USE_CUDA=TRUE``. This means that the :cpp:`dataPtr` associated with particles
As with ``MultiFab``, particle data stored in AMReX ``ParticleContainer`` classes can be
stored in GPU-accessible memory when AMReX is compiled with ``USE_CUDA=TRUE``, ``USE_HIP=TRUE``,
or ``USE_SYCL=TRUE``. The type of memory used by a given ``ParticleContainer`` can be controlled
by the ``Allocator`` template parameter. By default, when compiled with GPU support ``ParticleContainer`` uses ``The_Arena()``. This means that the :cpp:`dataPtr` associated with particle data
can be passed into GPU kernels. These kernels can be launched with a variety of approaches,
including Cuda C / Fortran and OpenACC. An example Fortran particle subroutine offloaded via OpenACC might
look like the following:
including AMReX's native kernel launching mechanisms as well OpenMP and OpenACC. Using AMReX's C++ syntax, a kernel launch involving particle data might look like:

.. highlight:: c++

::

for(MyParIter pti(pc, lev); pti.isValid(); ++pti)
{
auto& ptile = pti.GetParticleTile();
auto ptd = tile.getParticleTileData();
const auto np = tile.numParticles();
amrex::ParallelFor( np,
[=] AMREX_GPU_DEVICE (const int ip) noexcept
{
ptd[i].make_invalid();
});
}

The above code simply invalidates all particle on all particle tiles. The ``ParticleTileData``
object is analagous to ``Array4`` in that it stores pointers to particle data and can be used
on either the host or the device. This is a convenient way to pass particle data into GPU kernels
because the same object can be used regardless of whether the data layout is AoS or SoA.

An example Fortran particle subroutine offloaded via OpenACC might look like the following:

.. highlight:: fortran

Expand Down
Loading