Intel Paragon

Timing in seconds for MbCO + 3830 water molecules (14026 atoms), 1000 steps 12-14 A shift on Intel Paragon.

Nodes 1 2 4 8 16 32 64 128

E ext 51310.6 25655.3 12797.5 6408.5 3267.4 1666.4 851.6 438.5 E int 396.1 212.3 133.7 92.1 67.9 55.9 49.5 46.3 Wait 0.0 151.5 84.1 51.1 36.5 27.4 35.0 20.2 Comm 0.0 33.0 52.3 65.4 76.5 84.9 98.3 104.8 List 3275.9 1695.7 1010.1 504.5 255.2 133.2 72.0 42.4 Integ 126.2 75.0 38.9 21.2 12.7 8.8 7.2 6.8 Total 55108.8 27822.8 14116.6 7142.8 3716.2 1976.7 1113.6 659.0

Total(hours) 15.31 7.73 3.92 1.98 1.03 0.55 0.31 0.18 Eff 100.0% 99.0% 97.6% 96.4% 92.6% 87.1% 77.3% 65.3% Speedup 1.0 1.98 3.9 7.7 15 28 49 84

E ext : External energy terms (electrostatics + Lenard-Jones) E int : Internal energy terms (bond, angle, dihedral) Wait : Load unbalance Comm : Communication time (Vector Distr. Global {Sum,Brdcst}) List : Nonbond list generation time Integ : Time needed to integrate equations of motion Total : Total elapsed time Eff : Efficiency = speedup divided by number of nodes Speedup: Time for N nodes divided by time for one node.

Comments

See Intel WWW pages for more details

Timothy G. Matson (tgm@ssd.intel.com) writes:

However, I have also been running the program on our existing 
supercomputers.  There are a few simple optimizations I 
had to carry out.  The most important was to replace your
own global ops with the most recent van de Geijn library 
called iCC.  There was a race condition (which I documented
and passed onto Bernie) in the default routines.  Also, the
iCC library runs on any number of nodes so I don't have to
worry about the power of 2 stuff.

The other optimization was to limit the number of communication 
buffers managed by each Paragon node by setting the runtime
flag -loc to to log(P)+1 where P is the number of processors.
This lets the message passing take place much more efficiently.

If I do that, I get the following numbers for the Paragon 
using R1.2 with the message co-processor turned on:

        Paragon XP  &   1    & 14.37   \\
                    &   8    &  1.92   \\
                    &  16    &  0.98  \\
                    &  32    &  0.52  \\
                    &  64    &  0.29  \\
                    & 128    &  0.18   \\
                    & 256    &  0.14   \\
                    & 512    &  0.098  \\ \hline

This is pretty cool!!! As far as I know, the 0.098 number
is the fastest time by anyone for this benchmark.