USQCD Machine Performance



Machine Processor per node total no. of nodes total no. of cores DWF per node Clover per node asqtad per node Jpsi Equivalence
jpsi 2.1 GHz Dual CPU Quad Core Opteron 856 6848 10061 MFlops 7423 MFlops 9563 MFlops 1 Jpsi-core-hour
ds 2.0 GHz Quad CPU Eight Core Opteron 421 13472 51520 MFlops 42048 MFlops 50547 MFlops 1.33 Jpsi-core-hour
bc 2.8 GHz Quad CPU Eight Core Opteron 224 7168 57408 MFlops 46048 MFlops 56224 MFlops 1.48 Jpsi-core-hour
pi0 2.6 GHz Dual CPU Eight Core Xeon (Ivy Bridge) 314 3424 69040 MFlops 47168 MFlops 53436 MFlops 3.14 Jpsi-core-hour
9q 2.4 GHz Dual CPU Quad Core Nehalem 320 2560 19928 MFlops 15056 MFlops 18128 MFlops 1.96 Jpsi-core-hour
10q 2.53 GHz Dual CPU Quad Core Nehalem 224 1792 20408 MFlops 15656 MFlops 18046 MFlops 2.00 Jpsi-core-hour
12s 2.0 GHz Dual CPU Eight Core Sandy Bridge 276 4416 56500 MFlops 32040 MFlops 43740 MFlops 2.44 Jpsi-core-hour
16p Intel(R) Xeon Phi(TM) CPU 7230 @ 1.30GHz (Knight’s Landing) 264 16,896 - - ?? Jpsi-core-hour
BlueGene/Q - - - - - - 1.64 Jpsi-core-hour
Cray XT5 - - - - - - 1.0 Jpsi-core-hour

COMMENTS:

The table above shows the measured performance of DWF, anisotropic clover, and asqtad inverters on the jpsi, Ds, Bc, pi0, 9q and 10q clusters, on the ANL BG/P, and the ORNL XT5. All performance numbers are single precision unless otherwise noted. Please note that the jpsi cluster is no longer available, but the data are included for reference.

The DWF, Clover and asqtad performance figures for jpsi, Ds, Bc, pi0, 9q and 10q used 128-process (16-node, 4-node, 4-node, 8-node, 16-node,and 16-node respectively) runs, with 8, 16, or 32 processes per node, one process per core. DWF and Clover data were taken with Chroma. Clover runs used 63×64 local (per core) lattices, and DWF runs used 14×7×7×16 local (per core) lattices with Ls = 16. The runs for asqtad used 144 local (per core) lattices. Clover and DWF performance measurements used the CG_INVERTER in Chroma.

The DWF, Clover and asqtad performance figures for 12s are estimates taken from single node benchmarks and an assumed 0.9 scaling factor between single node (16 rank) and eight node (128 rank) runs.

The BG/Q is based on average of DWF and HISQ performances.

The XT5 clover performance figure is based on anisotropic clover calculations on 403×256 global volume run on 4K node runs.

The final column of the table gives the Jpsi-equivalence for each of the USQCD resources. All except the Cray XT5 use the ratio of the average performance of asqtad and DWF; the XT5 uses the ratio of the average performance of the asqtad (HISQ) and clover inverters.