	Using MOLDY as a Benchmark
	__________________________

Moldy has been tested on a variety of machines and provides a
reasonable example of a scientific floating point program which is not
a heavy user of I/O or memory. However it does access memory quite
rapidly in the inner loops, and in some at 1 memory operation per
floating point operation.  It is well suited to vector machines but
for several reasons does not usually approach the "peak" figure.

There are two major parts to the computation.  1) The "Ewald sum"
consists mainly of linear passes through arrays doing multiplies and
adds, with a few sum reductions.  This is expensive on scalar machines
but very fast on vector machines with good memory bandwidth.  One
function, qsincos(), was measured by the PERFTRACE utility on the Cray
XMP to be achieving 230 Mflop/s -- exactly the peak!  2) The "real
space force calculation" which has some scalar overhead and a
vectorisable part which makes heavy use of gather/scatter operations
and math library functions, especially sqrt() and exp().  These
should all make use of vector hardware but the combination reduces
the MFlop rate substantially.   

It is thus reasonably demanding of both compilers and hardware and
tests mainly CPU rate and memory bandwidth.

To use:  compile and build the program (see READ.ME) and execute with
the test case control file "control.water", directing the output to
some other file.

% moldy control.water > bench.out

This run executes 10 timesteps of the standard benchmark test case,
and prints timing information as the bottom of the file.  The output
should be numerically identical to the sample in "water-example.out",
which was run on a SparcStation II.  (Floating point roundoff should
*not* make any significant difference.)

For a reasonable supercomputer this run is probably too short, so try
the run in "control.100" (which performs 100 steps) instead, and divide
the times by 10.

	Rules of the Benchmark
	______________________

The idea is to allow minor machine-specific optimizations, such as
adding compiler directives, replacing the BLAS-like calls in "aux.c"
with calls to assembler libraries, but not to change the substance
of the algorithm.  Minor rewriting of loops to enable efficient
vectorization is allowed.  The results must, of course, be correct.

	Parallel Version
	________________

This is still experimental and not so portable.  It is designed for
compilers with the ability to parallelise loops.  Replace "ewald.c"
and "force.c" with their "x_parallel.c" variants.  The loops over
the calls to functions "ewald_inner()" and "force_inner()" are the
ones which must parallelize, and be executed one iteration per
processor.  The program reads the environment variable "THREADS" to
determine the number of processors to use.

This version has been tested on the Stardent 1000, 2000 and 3000
(titan) series machines and the Convex C2 (though it has not been
benchmarked on the latter).  For the Stardent Titan the macro
PARALLEL must be defined for the compilation of main.c and alloc.c to
set up correctly for a parallel run.  Good luck.

	Keith Refson, August 1991

