Timings#
All results on H100 (Musica), 3 runs averaged, single GPU.
Software: NGSolve 6.2.2604-9-gf15d395df, CUDA 12.9.
Hardware: NVIDIA H100 (GPU) | AMD EPYC Zen4, dual-socket, 384 logical threads.
Part 1: DevCGSolver — Poisson (symmetric SPD)#
Problem: unit square, H1 order 2, Jacobi smoother preconditioner (CreateSmoother) — matches tutorial 5.5.1
Baselines: CPU CGSolver, C++ CGSolver with device matrices, DevCGSolver no-graph, DevCGSolver Conditional While Graph
ndof |
CPU all-T (ms) |
CPU 1T (ms) |
C++ dev (ms) |
No-graph (ms) |
Conditional While Graph (ms) |
vs 1T |
vs C++ dev |
vs No-graph |
|---|---|---|---|---|---|---|---|---|
1,961 |
98.5 |
1.9 |
6.8 |
5.7 |
3.5 |
0.5× |
1.95× |
1.64× |
5,277 |
346.6 |
8.5 |
10.8 |
9.1 |
5.0 |
1.7× |
2.14× |
1.81× |
11,825 |
308.5 |
29.0 |
16.0 |
13.7 |
7.5 |
3.9× |
2.12× |
1.81× |
46,741 |
1,351.5 |
235.9 |
35.2 |
29.2 |
16.8 |
14.0× |
2.09× |
1.74× |
95,225 |
2,467.0 |
695.0 |
51.6 |
44.6 |
27.1 |
25.7× |
1.91× |
1.65× |
185,809 |
4,607.0 |
1,973.7 |
81.4 |
74.9 |
49.5 |
39.9× |
1.64× |
1.51× |
514,637 |
28,133.6 |
9,783.5 |
198.0 |
177.9 |
134.6 |
72.7× |
1.47× |
1.32× |
CPU all-T: NGSolve TaskManager with all 384 logical threads on the node (dual-socket AMD EPYC Zen4, ignores SLURM
--cpus-per-task=22).CPU 1T: no TaskManager, single-threaded reference — unambiguous lower bound for CPU performance.
GPU column times (C++ dev, no-graph, Conditional While Graph) are stable and node-independent.
Part 2: DevTFQMRSolver — 3D convection (non-symmetric)#
Problem: unit cube, DG L2 order 2, convection-diffusion, block smoother preconditioner
Comparison: Python TFQMR with device matrices, DevTFQMRSolver no-graph, DevTFQMRSolver Conditional While Graph
ndof |
Python TFQMR (ms) |
No-graph (ms) |
Conditional While Graph (ms) |
vs Python dev |
vs No-graph |
|---|---|---|---|---|---|
2,060 |
3.6 |
1.8 |
1.5 |
2.4× |
1.26× |
6,520 |
4.6 |
2.4 |
1.7 |
2.7× |
1.39× |
20,250 |
6.3 |
3.1 |
2.3 |
2.7× |
1.38× |
47,810 |
8.5 |
4.7 |
3.2 |
2.7× |
1.50× |
149,650 |
15.5 |
9.9 |
8.0 |
1.9× |
1.23× |
340,370 |
29.1 |
21.3 |
19.1 |
1.5× |
1.12× |