Timings

Timings#

All results on H100 (Musica), 3 runs averaged, single GPU. Software: NGSolve 6.2.2604-9-gf15d395df, CUDA 12.9.
Hardware: NVIDIA H100 (GPU) | AMD EPYC Zen4, dual-socket, 384 logical threads.

Part 1: DevCGSolver — Poisson (symmetric SPD)#

Problem: unit square, H1 order 2, Jacobi smoother preconditioner (CreateSmoother) — matches tutorial 5.5.1
Baselines: CPU CGSolver, C++ CGSolver with device matrices, DevCGSolver no-graph, DevCGSolver Conditional While Graph

ndof	CPU all-T (ms)	CPU 1T (ms)	C++ dev (ms)	No-graph (ms)	Conditional While Graph (ms)	vs 1T	vs C++ dev	vs No-graph
1,961	98.5	1.9	6.8	5.7	3.5	0.5×	1.95×	1.64×
5,277	346.6	8.5	10.8	9.1	5.0	1.7×	2.14×	1.81×
11,825	308.5	29.0	16.0	13.7	7.5	3.9×	2.12×	1.81×
46,741	1,351.5	235.9	35.2	29.2	16.8	14.0×	2.09×	1.74×
95,225	2,467.0	695.0	51.6	44.6	27.1	25.7×	1.91×	1.65×
185,809	4,607.0	1,973.7	81.4	74.9	49.5	39.9×	1.64×	1.51×
514,637	28,133.6	9,783.5	198.0	177.9	134.6	72.7×	1.47×	1.32×

CPU all-T: NGSolve TaskManager with all 384 logical threads on the node (dual-socket AMD EPYC Zen4, ignores SLURM --cpus-per-task=22).

CPU 1T: no TaskManager, single-threaded reference — unambiguous lower bound for CPU performance.
GPU column times (C++ dev, no-graph, Conditional While Graph) are stable and node-independent.

Part 2: DevTFQMRSolver — 3D convection (non-symmetric)#

Problem: unit cube, DG L2 order 2, convection-diffusion, block smoother preconditioner
Comparison: Python TFQMR with device matrices, DevTFQMRSolver no-graph, DevTFQMRSolver Conditional While Graph

ndof	Python TFQMR (ms)	No-graph (ms)	Conditional While Graph (ms)	vs Python dev	vs No-graph
2,060	3.6	1.8	1.5	2.4×	1.26×
6,520	4.6	2.4	1.7	2.7×	1.39×
20,250	6.3	3.1	2.3	2.7×	1.38×
47,810	8.5	4.7	3.2	2.7×	1.50×
149,650	15.5	9.9	8.0	1.9×	1.23×
340,370	29.1	21.3	19.1	1.5×	1.12×

Timings

Contents

Timings#

Part 1: DevCGSolver — Poisson (symmetric SPD)#

Part 2: DevTFQMRSolver — 3D convection (non-symmetric)#