[molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster

Thu Jun 21 23:39:23 BST 2012

Anatoliy,

Molpro's default behaviour is to use 8-byte integers on 64-bit machines, 
and 4-byte integers on 32-bit machines. You are correct that by default 
gfortran (and almost all compilers) give 4 byte integers on 64-bit 
machines, but if you look at FFLAGS in CONFIG you will see that we set:

-fdefault-integer-8

and equivalent for other compilers, so Molpro is using 8-byte integers.

Can you see if adding scfpr2.F to OPT2 in CONFIG, i.e.:

OPT2=scfpr2.F

and then:

rm -f lib/libmolpro.a src/scf/scfpr2.o
make

allows the testjobs to run?

Best wishes,

Andy

On 21/06/12 13:24, Anatoliy Volkov wrote:
> Dear Andy,
>
> Please find attached my CONFIG file.
>
> I thought that gfortran64_int64 version of ACML is needed when -i8 option is used, that
> is when default integers are 8 bytes. I may be wrong but I always thought that by default,
> 64-bit gfortran under x86_64 Linux uses 4 byte integers.
>
> BTW, only the following test jobs completed with errors:
> coreocc, Cs_DKH10,  Cs_DKH2, Cs_DKH2_standard,  Cs_DKH3,  Cs_DKH4,  Cs_DKH7,  Cs_DKH8.
> All others seem to be fine.
>
> Thank you again for your help.
>
> Best Regards,
> Anatoliy
>
> ________________________________________
> From: mayaj1 at Cardiff.ac.uk [mayaj1 at Cardiff.ac.uk]
> Sent: Thursday, June 21, 2012 3:59 AM
> To: Anatoliy Volkov
> Cc: molpro-user at molpro.net
> Subject: Re: [molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster
>
> Anatoliy,
>
> It could be that gfortran has been too aggressive when optimizing some
> file. If you can send me the CONFIG file then I'll try to reproduce with
> exactly the same tools.
>
> I think that -blaspath should be:
>
> /usr/local/acml5.1.0/gfortran64_int64/lib
>
> i.e. the 64-bit integer version of the acml library.
>
> Best wishes,
>
> Andy
>
> On 20/06/12 17:36, Anatoliy Volkov wrote:
>> Dear Andy,
>>
>> Thank you very much for your prompt reply. I have managed to compile Molpro 2010.1.25
>> using the following options:
>> ./configure -gcc -gfortran -openmpi -mpp -mppbase /usr/local/openmpi_gcc46/include -blas -blaspath /usr/local/acml5.1.0/gfortran64/lib
>> I believe tuning went well (no errors reported), but when running tests on 4 processors:
>> make MOLPRO_OPTIONS=-n4 test
>> I have gotten errors for the following tests : Cs_DKH10, Cs_DKH2, Cs_DKH2_standard,
>> Cs_DKH3, Cs_DKH4, Cs_DKH7, Cs_DKH8. I believe the error is the same and comes from
>> MPI:
>>
>> Running job Cs_DKH10.test
>> [wizard:11602] *** An error occurred in MPI_Allreduce
>> [wizard:11602] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>> [wizard:11602] *** MPI_ERR_TRUNCATE: message truncated
>> [wizard:11602] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>
>> Running job Cs_DKH2.test
>> [wizard:11720] *** An error occurred in MPI_Allreduce
>> [wizard:11720] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>> [wizard:11720] *** MPI_ERR_TRUNCATE: message truncated
>> [wizard:11720] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>
>> .....
>> .....
>>
>> Running job Cs_DKH8.test
>> [wizard:12281] *** An error occurred in MPI_Allreduce
>> [wizard:12281] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>> [wizard:12281] *** MPI_ERR_TRUNCATE: message truncated
>> [wizard:12281] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>
>>
>> Interestingly enough, there are no errors for other tests I have run so far:
>> Running job Cs_nr.test
>> Running job allene_opt.test
>> Running job allyl_cipt2.test
>> Running job allyl_ls.test
>> Running job ar2_dk_dummy.test
>> Running job au2o_optdftecp1.test
>> Running job au2o_optdftecp2.test
>> Running job au2o_optecp.test
>> Running job aucs4k2.test
>> Running job b_cidft.test
>> Running job basisinput.test
>> Running job bccd_opt.test
>> Running job bccd_save.test
>> Running job benz_nlmo.test
>> Running job benzol_giao.test
>> Running job big_lattice.test
>> Running job br2_f12_multgem.test
>> Running job c2f4_cosmo.test
>> Running job c2h2_dfmp2.test
>> Running job c2h4_c1_freq.test
>> Running job c2h4_ccsd-f12.test
>> Running job c2h4_ccsdfreq.test
>> Running job c2h4_cosmo.test
>> Running job c2h4_cosmo_direct.test
>> Running job c2h4_d2.test
>> Running job c2h4_d2h.test
>> Running job c2h4_d2h_freq.test
>> Running job c2h4_ksfreq.test
>> Running job c2h4_lccsd.test
>> Running job c2h4_lccsd2.test
>> Running job c2h4_lccsd3.test
>> Running job c2h4_lmp2.test
>> Running job c2h4_optnum.test
>> Running job c2h4_prop.test
>> Running job c2h4o_cosmo.test
>> Running job c6h6_freq.test
>> Running job c6h6_freq_restart.test
>> Running job c6h6_opt.test
>>
>> Do you have any suggestions on how to fix the MPI error ? Should I worry about it at all ?
>>
>> Thank you in advance for your help.
>>
>> Best Regards,
>> Anatoliy
>>
>> ________________________________________
>> From: mayaj1 at Cardiff.ac.uk [mayaj1 at Cardiff.ac.uk]
>> Sent: Wednesday, June 20, 2012 4:21 AM
>> To: Anatoliy Volkov
>> Cc: molpro-user at molpro.net
>> Subject: Re: [molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster
>>
>> Anatoliy,
>>
>> Yes, there appears to be some problem when running the binaries on
>> openSUSE 12.1. It is not simply a binary incompatibility problem; I see
>> the same issue building from source code on 12.1. Apparently, building
>> with pure TCGMSG produces an executable which crashes upon the first
>> call to a global arrays routine. We have this bug reported:
>>
>> https://www.molpro.net/bugzilla/show_bug.cgi?id=3712
>>
>> and are looking into a fix.
>>
>> I see that you have access to the source code, so you can easily build a
>> TCGMSG-MPI or MPI2 version from source which should work fine. Please
>> let us know if you have any problems building.
>>
>> Just for information, rsh should be the default for the binaries, but it
>> can be changed by setting TCGRSH environment variable (or passing
>> --tcgssh option to bin/molpro shell script).
>>
>> Best wishes,
>>
>> Andy
>>
>> On 19/06/12 18:16, Anatoliy Volkov wrote:
>>> Greetings,
>>>
>>> I seem to have hit a wall trying to get molpro-mpp-2010.1-24.Linux_x86_64 to run
>>> on my AMD-based cluster (16 nodes, 6-core Phenom II X6 1090T or FX-6100 cpus, and
>>> 16 GB RAM per node,  OpenSUSE 12.1 x86-64, kernel 3.1.10-1.9-desktop),  while there are
>>> absolutely no issues running  the same version of Molpro on my old Intel-based cluster
>>> (dual-socket quad-core Xeon E5230 cpus, 16 GB RAM per node, OpenSUSE 11.4 x86-64,
>>> kernel 2.6.37.6-0.11-desktop)
>>>
>>> On the AMD cluster, when Molpro starts to run on the master nodes, it tries of allocate a lot of memory,
>>> and then dies. I have taken a couple of snapshots of 'top' (see attached top.log file)
>>>
>>> At first it tries to allocate 9GB, then 19 GB, then 25 GB etc., and then it dies with the
>>> following error in TORQUE log file:
>>>
>>> Running Molpro
>>> tmp = /home/avolkov/pdir//usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe.p
>>>
>>>     Creating: host=viz01, user=avolkov,
>>>               file=/usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe, port=34803
>>>
>>>     60: ListenAndAccept: timeout waiting for connection 0 (0).
>>>
>>> 0.008u 0.133s 3:01.86 0.0%    0+0k 9344+40io 118pf+0w
>>>
>>> I am not sure I understand what is happening here. My cluster uses passwordless rsh and
>>> I have not noticed any issues with communication between nodes.. At least my own code that
>>> I compile using rsh-enabled OpenMPI runs just fine on the cluster. Could it be that this version of
>>> Molpro tries ti use ssh? But then I do not understand why it works on my Intel cluster where only
>>> rsh is available...
>>>
>>> On both clusters Molpro has been installed the same way (/usr/local/molpro, NFS mounted on
>>> all nodes) and pretty much the same TORQUE script is used.
>>>
>>> On both clusters, I start Molpro using the following command in my TORQUE script:
>>>
>>> time /usr/local/molpro/molpro -m 64M -o $ofile -d $SCR -N $TASKLIST $ifile
>>>
>>> where, $TASKLIST is defined by the TORQUE script, and in case of the latest failed job
>>> on the AMD cluster, had the following value:
>>> TASKLIST = viz01:6,viz02:6,viz03:6,viz04:6,viz05:6,viz06:6,viz07:6,viz08:6,viz09:6,viz10:6
>>>
>>> In the temp directory, file molpro_options.31159 contained:
>>>     -m 64M -o test.out -d /tmp/16.wizard.cs.mtsu.edu test.com
>>> while file procgrp.31159 was as follows:
>>> avolkov viz01 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>>> ......
>>> avolkov viz02 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>>> .....
>>> .....
>>> avolkov viz10 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>>>
>>> BTW, test.out file is never created....
>>>
>>> Contents of test.com file:
>>> ! $Revision: 2006.3 $
>>> ***,bullvalene                  !A title
>>> memory,64,M                     ! 1 MW = 8 MB
>>> basis=cc-pVTZ
>>> geomtyp=xyz
>>> geometry={
>>> 20           ! number of atoms
>>> this is where you put your title
>>>     C          1.36577619     -0.62495122     -0.63870960
>>>     C          0.20245537     -1.27584792     -1.26208804
>>>     C         -1.09275642     -1.01415419     -1.01302123
>>> .........
>>> }
>>> ks,b3lyp
>>>
>>> What am I doing wrong here ?
>>>
>>> Thank you in advance for your help!
>>>
>>> Best Regards,
>>> Anatoliy
>>> ---------------------------
>>> Anatoliy Volkov, Ph.D.
>>> Associate Professor
>>> Department of Chemistry
>>> Middle Tennessee State University
>>>
>>>
>>>
>>> _______________________________________________
>>> Molpro-user mailing list
>>> Molpro-user at molpro.net
>>> http://www.molpro.net/mailman/listinfo/molpro-user
>>>
>>
>>
>>
>
>