[molpro-user] values for --tuning-mpplat and --tuning-mppspeed
werner at theochem.uni-stuttgart.de
Thu Dec 29 10:34:09 CET 2016
These parameters only affect the blocking size in the parallel matrix multiplication (mxma_mpp). This also depends on the cpu speed and matrix dimensions (basis size), and therefore it is hard to make general predictions. The best is simply to try! If you use a small latency and a large bandwidth value the blocks may get smaller and more cores may be used in the matrix multiplications. Overall, the effect is rather small, as you noticed.
On our cluster with infiniband QDR network and Xeon(R) CPU E5-2680 v2 @ 2.80GHz, mpptune gives the following values
10 cores 1 node: LATENCY= 5 MICROSEC, SPEED= 2390 MB/SEC, MXMA_MPP 136 GFLOP/SEC (max)
20 cores 1 node: LATENCY= 8 MICROSEC, SPEED= 1606 MB/SEC, MXMA_MPP 159 GFLOP/SEC
40 cores 4 nodes: LATENCY= 16 MICROSEC, SPEED= 760 MB/SEC, MXMA_MPP 146 GFLOP/SEC
80 cores 4 nodes: LATENCY= 16 MICROSEC, SPEED= 639 MB/SEC, MXMA_MPP 128 GFLOP/SEC
This is for matrix dimension 960, which is clearly too small to get the maximum speed.
For example, with n=5000 I get
80 cores 4 nodes: LATENCY= 16 MICROSEC, SPEED= 620 MB/SEC, MXMA_MPP 532 GFLOP/SEC
and with n=10000:
40 cores 4 nodes: LATENCY= 16 MICROSEC, SPEED= 743 MB/SEC, MXMA_MPP 637 GFLOP/SEC
80 cores 4 nodes: LATENCY= 16 MICROSEC, SPEED= 620 MB/SEC, MXMA_MPP 847 GFLOP/SEC
Note that mpptune uses the previous values in lib/tuning.rc for the matrix multiplications and only determines the latency and bandwidth at the end.
So it should be run twice. The speed is strongly limited by the bandwidth. And of course, your matrices will be much smaller and the parallelization be much less effective.
If the 10% gain in speed is important, you should optimize the values on your machine for a typical application.
> Am 28.12.2016 um 10:39 schrieb Gershom (Jan) Martin <gershom at weizmann.ac.il>:
> Dear Molpro gurus:
> What would be sensible values for --tuning-mpplat and --tuning-mppspeed between cluster nodes with an Infiniband 4-lane QDR (fully nonblocking) interconnect?
> What about FDR?
> The defaults in the distribution Linux/Intel binary are apparently
> --tuning-mpplat 3 and --tuning-mppspeed 1600
> The old mpptune.com script (no longer included with M2015, presumably obsolete) give latencies WAY in excess of this, even when run within a single node.
> I tried running an 8-box job (water tetramer CCSD(T)-F12c/cc-pVQZ-F12, no symmetry, 8 processes per box, 2 threads each) with custom parameters
> --tuning-mpplat 1 and --tuning-mppspeed 7000
> (i.e., close to the theoretical hardware values)
> and am seeing about a 10% speedup in wall clock time, but presumably I am living dangerously here, as the true latency and bandwidth reflect software overheads...
> On the same question: what values are optimal for running all inside one box?
> Many thanks in advance!
> Jan Martin
> *** "Computational quantum chemistry and more" ********* rMBP *************************************
> Dr. Gershom (Jan M.L.) Martin | Baroness Thatcher Professor of Chemistry
> Department of Organic Chemistry
> Weizmann Institute of Science | Kimmelman Building, Room 361 | 76100 Rehovot, ISRAEL
> Web: http://compchem.me | Skype: gershom2112
> mailto:gershom at weizmann.ac.il |
> Office: +972-8-9342533 | Fax: +972-8-9343029 | Mobile: +972-50-5109635
> Molpro-user mailing list
> Molpro-user at molpro.net
Prof. Dr. Hans-Joachim Werner
Institute for Theoretical Chemistry
University of Stuttgart
Tel: +49 711 / 685 64400
Fax: +49 711 / 685 64442
email: werner at theochem.uni-stuttgart.de
More information about the Molpro-user