[molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster

Wed Jun 20 17:36:38 BST 2012

Dear Andy,

Thank you very much for your prompt reply. I have managed to compile Molpro 2010.1.25
using the following options:
./configure -gcc -gfortran -openmpi -mpp -mppbase /usr/local/openmpi_gcc46/include -blas -blaspath /usr/local/acml5.1.0/gfortran64/lib
I believe tuning went well (no errors reported), but when running tests on 4 processors:
make MOLPRO_OPTIONS=-n4 test
I have gotten errors for the following tests : Cs_DKH10, Cs_DKH2, Cs_DKH2_standard, 
Cs_DKH3, Cs_DKH4, Cs_DKH7, Cs_DKH8. I believe the error is the same and comes from
MPI:

Running job Cs_DKH10.test
[wizard:11602] *** An error occurred in MPI_Allreduce
[wizard:11602] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[wizard:11602] *** MPI_ERR_TRUNCATE: message truncated
[wizard:11602] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

Running job Cs_DKH2.test
[wizard:11720] *** An error occurred in MPI_Allreduce
[wizard:11720] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[wizard:11720] *** MPI_ERR_TRUNCATE: message truncated
[wizard:11720] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

.....
.....

Running job Cs_DKH8.test
[wizard:12281] *** An error occurred in MPI_Allreduce
[wizard:12281] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[wizard:12281] *** MPI_ERR_TRUNCATE: message truncated
[wizard:12281] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

Interestingly enough, there are no errors for other tests I have run so far:
Running job Cs_nr.test
Running job allene_opt.test
Running job allyl_cipt2.test
Running job allyl_ls.test
Running job ar2_dk_dummy.test
Running job au2o_optdftecp1.test
Running job au2o_optdftecp2.test
Running job au2o_optecp.test
Running job aucs4k2.test
Running job b_cidft.test
Running job basisinput.test
Running job bccd_opt.test
Running job bccd_save.test
Running job benz_nlmo.test
Running job benzol_giao.test
Running job big_lattice.test
Running job br2_f12_multgem.test
Running job c2f4_cosmo.test
Running job c2h2_dfmp2.test
Running job c2h4_c1_freq.test
Running job c2h4_ccsd-f12.test
Running job c2h4_ccsdfreq.test
Running job c2h4_cosmo.test
Running job c2h4_cosmo_direct.test
Running job c2h4_d2.test
Running job c2h4_d2h.test
Running job c2h4_d2h_freq.test
Running job c2h4_ksfreq.test
Running job c2h4_lccsd.test
Running job c2h4_lccsd2.test
Running job c2h4_lccsd3.test
Running job c2h4_lmp2.test
Running job c2h4_optnum.test
Running job c2h4_prop.test
Running job c2h4o_cosmo.test
Running job c6h6_freq.test
Running job c6h6_freq_restart.test
Running job c6h6_opt.test

Do you have any suggestions on how to fix the MPI error ? Should I worry about it at all ?

Thank you in advance for your help.

Best Regards,
Anatoliy

________________________________________
From: mayaj1 at Cardiff.ac.uk [mayaj1 at Cardiff.ac.uk]
Sent: Wednesday, June 20, 2012 4:21 AM
To: Anatoliy Volkov
Cc: molpro-user at molpro.net
Subject: Re: [molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster

Anatoliy,

Yes, there appears to be some problem when running the binaries on
openSUSE 12.1. It is not simply a binary incompatibility problem; I see
the same issue building from source code on 12.1. Apparently, building
with pure TCGMSG produces an executable which crashes upon the first
call to a global arrays routine. We have this bug reported:

https://www.molpro.net/bugzilla/show_bug.cgi?id=3712

and are looking into a fix.

I see that you have access to the source code, so you can easily build a
TCGMSG-MPI or MPI2 version from source which should work fine. Please
let us know if you have any problems building.

Just for information, rsh should be the default for the binaries, but it
can be changed by setting TCGRSH environment variable (or passing
--tcgssh option to bin/molpro shell script).

Best wishes,

Andy

On 19/06/12 18:16, Anatoliy Volkov wrote:
> Greetings,
>
> I seem to have hit a wall trying to get molpro-mpp-2010.1-24.Linux_x86_64 to run
> on my AMD-based cluster (16 nodes, 6-core Phenom II X6 1090T or FX-6100 cpus, and
> 16 GB RAM per node,  OpenSUSE 12.1 x86-64, kernel 3.1.10-1.9-desktop),  while there are
> absolutely no issues running  the same version of Molpro on my old Intel-based cluster
> (dual-socket quad-core Xeon E5230 cpus, 16 GB RAM per node, OpenSUSE 11.4 x86-64,
> kernel 2.6.37.6-0.11-desktop)
>
> On the AMD cluster, when Molpro starts to run on the master nodes, it tries of allocate a lot of memory,
> and then dies. I have taken a couple of snapshots of 'top' (see attached top.log file)
>
> At first it tries to allocate 9GB, then 19 GB, then 25 GB etc., and then it dies with the
> following error in TORQUE log file:
>
> Running Molpro
> tmp = /home/avolkov/pdir//usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe.p
>
>   Creating: host=viz01, user=avolkov,
>             file=/usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe, port=34803
>
>   60: ListenAndAccept: timeout waiting for connection 0 (0).
>
> 0.008u 0.133s 3:01.86 0.0%    0+0k 9344+40io 118pf+0w
>
> I am not sure I understand what is happening here. My cluster uses passwordless rsh and
> I have not noticed any issues with communication between nodes.. At least my own code that
> I compile using rsh-enabled OpenMPI runs just fine on the cluster. Could it be that this version of
> Molpro tries ti use ssh? But then I do not understand why it works on my Intel cluster where only
> rsh is available...
>
> On both clusters Molpro has been installed the same way (/usr/local/molpro, NFS mounted on
> all nodes) and pretty much the same TORQUE script is used.
>
> On both clusters, I start Molpro using the following command in my TORQUE script:
>
> time /usr/local/molpro/molpro -m 64M -o $ofile -d $SCR -N $TASKLIST $ifile
>
> where, $TASKLIST is defined by the TORQUE script, and in case of the latest failed job
> on the AMD cluster, had the following value:
> TASKLIST = viz01:6,viz02:6,viz03:6,viz04:6,viz05:6,viz06:6,viz07:6,viz08:6,viz09:6,viz10:6
>
> In the temp directory, file molpro_options.31159 contained:
>   -m 64M -o test.out -d /tmp/16.wizard.cs.mtsu.edu test.com
> while file procgrp.31159 was as follows:
> avolkov viz01 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
> ......
> avolkov viz02 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
> .....
> .....
> avolkov viz10 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>
> BTW, test.out file is never created....
>
> Contents of test.com file:
> ! $Revision: 2006.3 $
> ***,bullvalene                  !A title
> memory,64,M                     ! 1 MW = 8 MB
> basis=cc-pVTZ
> geomtyp=xyz
> geometry={
> 20           ! number of atoms
> this is where you put your title
>   C          1.36577619     -0.62495122     -0.63870960
>   C          0.20245537     -1.27584792     -1.26208804
>   C         -1.09275642     -1.01415419     -1.01302123
> .........
> }
> ks,b3lyp
>
> What am I doing wrong here ?
>
> Thank you in advance for your help!
>
> Best Regards,
> Anatoliy
> ---------------------------
> Anatoliy Volkov, Ph.D.
> Associate Professor
> Department of Chemistry
> Middle Tennessee State University
>
>
>
> _______________________________________________
> Molpro-user mailing list
> Molpro-user at molpro.net
> http://www.molpro.net/mailman/listinfo/molpro-user
>