### Timing in seconds for MbCO + 3830 water molecules (14026 atoms), 1000 steps 12-14 A shift on Intel Paragon.

Nodes            1          2         4       8      16       32     64    128

E ext        51310.6   25655.3   12797.5   6408.5  3267.4  1666.4  851.6  438.5
E int          396.1     212.3     133.7     92.1    67.9    55.9   49.5   46.3
Wait             0.0     151.5      84.1     51.1    36.5    27.4   35.0   20.2
Comm             0.0      33.0      52.3     65.4    76.5    84.9   98.3  104.8
List          3275.9    1695.7    1010.1    504.5   255.2   133.2   72.0   42.4
Integ          126.2      75.0      38.9     21.2    12.7     8.8    7.2    6.8
Total        55108.8   27822.8   14116.6   7142.8  3716.2  1976.7 1113.6  659.0

Total(hours)    15.31      7.73      3.92     1.98    1.03    0.55   0.31  0.18
Eff            100.0%      99.0%    97.6%    96.4%   92.6%   87.1%  77.3%  65.3%
Speedup          1.0        1.98      3.9     7.7    15      28     49     84

E ext :  External energy terms (electrostatics + Lenard-Jones)
E int :  Internal energy terms (bond, angle, dihedral)
Comm  :  Communication time (Vector Distr. Global {Sum,Brdcst})
List  :  Nonbond list generation time
Integ :  Time needed to integrate equations of motion
Total :  Total elapsed time
Eff   :  Efficiency = speedup divided by number of nodes
Speedup: Time for N nodes divided by time for one node.


See Intel WWW pages for more details

Timothy G. Matson (tgm@ssd.intel.com) writes:

However, I have also been running the program on our existing
supercomputers.  There are a few simple optimizations I
had to carry out.  The most important was to replace your
own global ops with the most recent van de Geijn library
called iCC.  There was a race condition (which I documented
and passed onto Bernie) in the default routines.  Also, the
iCC library runs on any number of nodes so I don't have to
worry about the power of 2 stuff.

The other optimization was to limit the number of communication
buffers managed by each Paragon node by setting the runtime
flag -loc to to log(P)+1 where P is the number of processors.
This lets the message passing take place much more efficiently.

If I do that, I get the following numbers for the Paragon
using R1.2 with the message co-processor turned on:

Paragon XP  &   1    & 14.37   \\
&   8    &  1.92   \\
&  16    &  0.98  \\
&  32    &  0.52  \\
&  64    &  0.29  \\
& 128    &  0.18   \\
& 256    &  0.14   \\
& 512    &  0.098  \\ \hline

This is pretty cool!!! As far as I know, the 0.098 number
is the fastest time by anyone for this benchmark.