GPU Computing¶

Exasim supports GPU execution through installed CUDA and HIP package variants and generated Kokkos-style model kernels. GPU execution is most effective when the simulation performs substantial element-local work per byte transferred.

Execution Model¶

The backend allocates runtime arrays through backend-aware allocation helpers and keeps large solution, residual, mesh, and temporary arrays resident on the active backend when possible. Generated model functions provide Kokkos-prefixed callbacks such as flux, source, boundary, output, QoI, and visualization kernels.

flowchart LR
  HOST["host setup<br/>datain/*.bin"] --> COPY["copy to backend memory"]
  COPY --> KERNELS["element / face / model kernels"]
  KERNELS --> BLAS["cuBLAS / hipBLAS / backend BLAS"]
  BLAS --> SOLVE["Newton-GMRES"]
  SOLVE --> OUTPUT["copy/write selected outputs"]

CUDA And HIP¶

The installed Exasim package exports CPU and GPU library variants. CUDA builds target NVIDIA GPUs; HIP builds target AMD GPUs. The selected executable links to the matching library target and uses the compiler/runtime for that backend.

What Should Stay On The GPU¶

For performance, the following should remain backend-resident during a solve:

solution arrays (udg, wdg, uh, and related work arrays);
residual and Krylov vectors;
element and face geometry data;
local matrix blocks for HDG;
preconditioner storage;
generated model-kernel input/output arrays.

Host-device transfers should occur at setup, output, restart/postprocessing, or explicit parameter updates, not inside every quadrature or element loop.

LDG On GPUs¶

LDG matrix-free GMRES repeatedly evaluates residuals and matrix-vector products. This is GPU-friendly when residual kernels are sufficiently arithmetic-heavy, but performance can suffer if residual differencing introduces many small kernel launches or synchronization points.

HDG On GPUs¶

HDG performs more local matrix work and trace-system operations. It can achieve high arithmetic intensity in local element kernels and dense linear algebra, but memory use for element matrices and preconditioner blocks must be monitored.

Performance Guidance¶

Increase polynomial order only when the extra arithmetic improves accuracy or amortizes memory movement.
Avoid frequent output in GPU runs; writing requires host-visible data.
Compare CPU and GPU residual histories for small cases before trusting a new GPU model.
For parameter sweeps, ensure each case updates both host-side and device-side physics parameters.
Use backend-specific profilers such as Nsight Systems/Compute or rocprof when investigating performance.

Correctness Guidance¶

GPU bugs often appear as stale data or backend mismatch rather than arithmetic errors. Check:

all case-dependent parameters are copied to device memory;
initialization callbacks run after parameter updates when required;
output buffers are copied back before file writes;
CUDA/HIP executable is linked against the matching Exasim package variant;
MPI rank-to-GPU mapping is correct on multi-GPU systems.