USQCD Machine Performance



MachineProcessor per nodetotal no. of nodestotal no. of coresDWF per nodeClover per nodeasqtad per nodeJpsi Equivalence
kaon2.0 GHz Dual CPU Dual Core Opteron60024004696 MFlops3180 MFlops3832 MFlops0.88 Jpsi-core-hour
jpsi2.1 GHz Dual CPU Quad Core Opteron856684810061 MFlops7423 MFlops9563 MFlops1 Jpsi-core-hour
7n1.9 GHz Dual CPU Quad Core Opteron39631688800 MFlops5148 MFlops6300 MFlops0.77 Jpsi-core-hour
9q2.4 GHz Dual CPU Quad Core Nehalem3202560 MFlops MFlops MFlops1.94 Jpsi-core-hour
9gGPU200- MFlops MFlops MFlopsTBD
QCDOC400 MHz PPC Core estimated1228812288336 MFlops-360 MFlops0.24 Jpsi-core-hour
BlueGene/P850 MHz Quad Core PowerPC 850 1024 per rack4096 per rack2560 MFlops2511 MFlops2680 MFlops0.54 Jpsi-core-hour
Cray XT52.1 GHz Quad Core Opteron783231328-2232 MFlops2260 MFlops0.50 Jpsi-core-hour

WARNING:

Performance numbers for the 9q cluster are preliminary estimates and will be updated soon. The conversion factors for the 9g GPU strongly depend on the usage model and will be determined at a later stage.

COMMENTS:

The table above shows the measured performance of DWF, anisotropic clover, and asqtad inverters on the kaon, jpsi, 7n and 9q clusters, on the ANL BG/P, the ORNL XT5 and the QCDOC. All performance numbers are single precision unless otherwise noted.

The DWF, Clover and asqtad performance figures for kaon, jpsi and 7n used 128-process (32-node, 16-node, 64-node,and 16-node respectively) runs, with 4, 2, or 8 processes per node, one process per core. DWF and Clover data were taken with Chroma. kaon and jpsi Clover runs used 6^3x64 local (per core) lattices, and DWF runs used 14x7x7x16 local (per core) lattices with Ls=16. Clover performance on 7n used 4^3x8 local volumes per process, DWF performance used global volume 24^3x64 (Ls=16), and asqtad performance used local volume 14^4.

The QCDOC DWF (double precision) and asqtad (single precision) estimates are based on the observed peak performance of the double precision conjugate gradient codes on early motherboards, scaled to 400 MHz. Clover performance data are not available.

The BG/P asqtad result is the average of the performance of 6^4 and 8^4 local volumes, and is single precision. The DWF result is double precision, using 4^4 (Ls=16) local volumes. The Clover result used 4096 cores.

The XT5 Clover performance figure is based on anisotropic Clover calculations on 32^3x256 global volume run on 24 cores (Robert Edwards) and HISQ runs on 64^3x128 lattices on 2k cores (Steve Gottlieb).

The final column of the table gives the Jpsi-equivalence for each of the USQCD resources. All except the Cray XT5 use the ratio of the average performance of asqtad and DWF; the XT5 uses the ratio of the average performance of the asqtad (HISQ) and clover inverters. Also, the QCDOC Jpsi-equivalence figure of 0.24 has been assigned to be consistent with prior years' accounting, rather than using the estimated DWF and asqtad performance values.