[molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster

Fri Jun 29 13:51:49 BST 2012

Anatoliy,

I think that it would be best to re-link with the 8-byte integer library 
since then everything is consistent. You can do this most simply by 
rerunning configure with a new blaspath. Since configure remembers old 
options, all you need is:

./configure -batch -blaspath /path/to/acml5.1.0/gfortran64_int64/lib

The fact it works with 4-byte integer library I think relates to the 
machine being little-endian, but I'm not certain about this. It is 
possible to safely use a 4-byte integer blas library with Molpro using 
8-byte integers; it just involves making sure the macros BLAS_INT and 
LAPACK_INT are set correctly in CONFIG. But I would recommend to use the 
same integers everywhere when possible.

Best wishes,

Andy

On 22/06/12 16:12, Anatoliy Volkov wrote:
> Dear Andy,
>
> Thanks! The OPT2=scfpr2.F solution worked like a charm.
> The new Molpro executable passes all tests.
>
> Going back to 8-byte integers, I somehow missed the -fdefault-integer-8 option in
> CONFIG file. Now I do not quite understand why linking with acml5.1.0/gfortran64/lib works...
> Do you think I should try to re-link the executable with acml5.1.0/gfortran64_int64/lib ?
> Would that give me a better performance ?
>
> Thank you again for your help!
>
> Best Regards,
> Anatoliy
>
> ________________________________________
> From: mayaj1 at Cardiff.ac.uk [mayaj1 at Cardiff.ac.uk]
> Sent: Thursday, June 21, 2012 5:39 PM
> To: Anatoliy Volkov
> Cc: molpro-user at molpro.net
> Subject: Re: [molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster
>
> Anatoliy,
>
> Molpro's default behaviour is to use 8-byte integers on 64-bit machines,
> and 4-byte integers on 32-bit machines. You are correct that by default
> gfortran (and almost all compilers) give 4 byte integers on 64-bit
> machines, but if you look at FFLAGS in CONFIG you will see that we set:
>
> -fdefault-integer-8
>
> and equivalent for other compilers, so Molpro is using 8-byte integers.
>
> Can you see if adding scfpr2.F to OPT2 in CONFIG, i.e.:
>
> OPT2=scfpr2.F
>
> and then:
>
> rm -f lib/libmolpro.a src/scf/scfpr2.o
> make
>
> allows the testjobs to run?
>
> Best wishes,
>
> Andy
>
> On 21/06/12 13:24, Anatoliy Volkov wrote:
>> Dear Andy,
>>
>> Please find attached my CONFIG file.
>>
>> I thought that gfortran64_int64 version of ACML is needed when -i8 option is used, that
>> is when default integers are 8 bytes. I may be wrong but I always thought that by default,
>> 64-bit gfortran under x86_64 Linux uses 4 byte integers.
>>
>> BTW, only the following test jobs completed with errors:
>> coreocc, Cs_DKH10,  Cs_DKH2, Cs_DKH2_standard,  Cs_DKH3,  Cs_DKH4,  Cs_DKH7,  Cs_DKH8.
>> All others seem to be fine.
>>
>> Thank you again for your help.
>>
>> Best Regards,
>> Anatoliy
>>
>> ________________________________________
>> From: mayaj1 at Cardiff.ac.uk [mayaj1 at Cardiff.ac.uk]
>> Sent: Thursday, June 21, 2012 3:59 AM
>> To: Anatoliy Volkov
>> Cc: molpro-user at molpro.net
>> Subject: Re: [molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster
>>
>> Anatoliy,
>>
>> It could be that gfortran has been too aggressive when optimizing some
>> file. If you can send me the CONFIG file then I'll try to reproduce with
>> exactly the same tools.
>>
>> I think that -blaspath should be:
>>
>> /usr/local/acml5.1.0/gfortran64_int64/lib
>>
>> i.e. the 64-bit integer version of the acml library.
>>
>> Best wishes,
>>
>> Andy
>>
>> On 20/06/12 17:36, Anatoliy Volkov wrote:
>>> Dear Andy,
>>>
>>> Thank you very much for your prompt reply. I have managed to compile Molpro 2010.1.25
>>> using the following options:
>>> ./configure -gcc -gfortran -openmpi -mpp -mppbase /usr/local/openmpi_gcc46/include -blas -blaspath /usr/local/acml5.1.0/gfortran64/lib
>>> I believe tuning went well (no errors reported), but when running tests on 4 processors:
>>> make MOLPRO_OPTIONS=-n4 test
>>> I have gotten errors for the following tests : Cs_DKH10, Cs_DKH2, Cs_DKH2_standard,
>>> Cs_DKH3, Cs_DKH4, Cs_DKH7, Cs_DKH8. I believe the error is the same and comes from
>>> MPI:
>>>
>>> Running job Cs_DKH10.test
>>> [wizard:11602] *** An error occurred in MPI_Allreduce
>>> [wizard:11602] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>>> [wizard:11602] *** MPI_ERR_TRUNCATE: message truncated
>>> [wizard:11602] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>>
>>> Running job Cs_DKH2.test
>>> [wizard:11720] *** An error occurred in MPI_Allreduce
>>> [wizard:11720] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>>> [wizard:11720] *** MPI_ERR_TRUNCATE: message truncated
>>> [wizard:11720] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>>
>>> .....
>>> .....
>>>
>>> Running job Cs_DKH8.test
>>> [wizard:12281] *** An error occurred in MPI_Allreduce
>>> [wizard:12281] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
>>> [wizard:12281] *** MPI_ERR_TRUNCATE: message truncated
>>> [wizard:12281] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>>
>>>
>>> Interestingly enough, there are no errors for other tests I have run so far:
>>> Running job Cs_nr.test
>>> Running job allene_opt.test
>>> Running job allyl_cipt2.test
>>> Running job allyl_ls.test
>>> Running job ar2_dk_dummy.test
>>> Running job au2o_optdftecp1.test
>>> Running job au2o_optdftecp2.test
>>> Running job au2o_optecp.test
>>> Running job aucs4k2.test
>>> Running job b_cidft.test
>>> Running job basisinput.test
>>> Running job bccd_opt.test
>>> Running job bccd_save.test
>>> Running job benz_nlmo.test
>>> Running job benzol_giao.test
>>> Running job big_lattice.test
>>> Running job br2_f12_multgem.test
>>> Running job c2f4_cosmo.test
>>> Running job c2h2_dfmp2.test
>>> Running job c2h4_c1_freq.test
>>> Running job c2h4_ccsd-f12.test
>>> Running job c2h4_ccsdfreq.test
>>> Running job c2h4_cosmo.test
>>> Running job c2h4_cosmo_direct.test
>>> Running job c2h4_d2.test
>>> Running job c2h4_d2h.test
>>> Running job c2h4_d2h_freq.test
>>> Running job c2h4_ksfreq.test
>>> Running job c2h4_lccsd.test
>>> Running job c2h4_lccsd2.test
>>> Running job c2h4_lccsd3.test
>>> Running job c2h4_lmp2.test
>>> Running job c2h4_optnum.test
>>> Running job c2h4_prop.test
>>> Running job c2h4o_cosmo.test
>>> Running job c6h6_freq.test
>>> Running job c6h6_freq_restart.test
>>> Running job c6h6_opt.test
>>>
>>> Do you have any suggestions on how to fix the MPI error ? Should I worry about it at all ?
>>>
>>> Thank you in advance for your help.
>>>
>>> Best Regards,
>>> Anatoliy
>>>
>>> ________________________________________
>>> From: mayaj1 at Cardiff.ac.uk [mayaj1 at Cardiff.ac.uk]
>>> Sent: Wednesday, June 20, 2012 4:21 AM
>>> To: Anatoliy Volkov
>>> Cc: molpro-user at molpro.net
>>> Subject: Re: [molpro-user] problems running molpro-mpp-2010.1-24.Linux_x86_64 on AMD-based cluster
>>>
>>> Anatoliy,
>>>
>>> Yes, there appears to be some problem when running the binaries on
>>> openSUSE 12.1. It is not simply a binary incompatibility problem; I see
>>> the same issue building from source code on 12.1. Apparently, building
>>> with pure TCGMSG produces an executable which crashes upon the first
>>> call to a global arrays routine. We have this bug reported:
>>>
>>> https://www.molpro.net/bugzilla/show_bug.cgi?id=3712
>>>
>>> and are looking into a fix.
>>>
>>> I see that you have access to the source code, so you can easily build a
>>> TCGMSG-MPI or MPI2 version from source which should work fine. Please
>>> let us know if you have any problems building.
>>>
>>> Just for information, rsh should be the default for the binaries, but it
>>> can be changed by setting TCGRSH environment variable (or passing
>>> --tcgssh option to bin/molpro shell script).
>>>
>>> Best wishes,
>>>
>>> Andy
>>>
>>> On 19/06/12 18:16, Anatoliy Volkov wrote:
>>>> Greetings,
>>>>
>>>> I seem to have hit a wall trying to get molpro-mpp-2010.1-24.Linux_x86_64 to run
>>>> on my AMD-based cluster (16 nodes, 6-core Phenom II X6 1090T or FX-6100 cpus, and
>>>> 16 GB RAM per node,  OpenSUSE 12.1 x86-64, kernel 3.1.10-1.9-desktop),  while there are
>>>> absolutely no issues running  the same version of Molpro on my old Intel-based cluster
>>>> (dual-socket quad-core Xeon E5230 cpus, 16 GB RAM per node, OpenSUSE 11.4 x86-64,
>>>> kernel 2.6.37.6-0.11-desktop)
>>>>
>>>> On the AMD cluster, when Molpro starts to run on the master nodes, it tries of allocate a lot of memory,
>>>> and then dies. I have taken a couple of snapshots of 'top' (see attached top.log file)
>>>>
>>>> At first it tries to allocate 9GB, then 19 GB, then 25 GB etc., and then it dies with the
>>>> following error in TORQUE log file:
>>>>
>>>> Running Molpro
>>>> tmp = /home/avolkov/pdir//usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe.p
>>>>
>>>>      Creating: host=viz01, user=avolkov,
>>>>                file=/usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe, port=34803
>>>>
>>>>      60: ListenAndAccept: timeout waiting for connection 0 (0).
>>>>
>>>> 0.008u 0.133s 3:01.86 0.0%    0+0k 9344+40io 118pf+0w
>>>>
>>>> I am not sure I understand what is happening here. My cluster uses passwordless rsh and
>>>> I have not noticed any issues with communication between nodes.. At least my own code that
>>>> I compile using rsh-enabled OpenMPI runs just fine on the cluster. Could it be that this version of
>>>> Molpro tries ti use ssh? But then I do not understand why it works on my Intel cluster where only
>>>> rsh is available...
>>>>
>>>> On both clusters Molpro has been installed the same way (/usr/local/molpro, NFS mounted on
>>>> all nodes) and pretty much the same TORQUE script is used.
>>>>
>>>> On both clusters, I start Molpro using the following command in my TORQUE script:
>>>>
>>>> time /usr/local/molpro/molpro -m 64M -o $ofile -d $SCR -N $TASKLIST $ifile
>>>>
>>>> where, $TASKLIST is defined by the TORQUE script, and in case of the latest failed job
>>>> on the AMD cluster, had the following value:
>>>> TASKLIST = viz01:6,viz02:6,viz03:6,viz04:6,viz05:6,viz06:6,viz07:6,viz08:6,viz09:6,viz10:6
>>>>
>>>> In the temp directory, file molpro_options.31159 contained:
>>>>      -m 64M -o test.out -d /tmp/16.wizard.cs.mtsu.edu test.com
>>>> while file procgrp.31159 was as follows:
>>>> avolkov viz01 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>>>> ......
>>>> avolkov viz02 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>>>> .....
>>>> .....
>>>> avolkov viz10 1 /usr/local/molpro/molprop_2010_1_Linux_x86_64_i8/bin/molpro.exe /data1/avolkov/benchmarks/molpro/bul
>>>>
>>>> BTW, test.out file is never created....
>>>>
>>>> Contents of test.com file:
>>>> ! $Revision: 2006.3 $
>>>> ***,bullvalene                  !A title
>>>> memory,64,M                     ! 1 MW = 8 MB
>>>> basis=cc-pVTZ
>>>> geomtyp=xyz
>>>> geometry={
>>>> 20           ! number of atoms
>>>> this is where you put your title
>>>>      C          1.36577619     -0.62495122     -0.63870960
>>>>      C          0.20245537     -1.27584792     -1.26208804
>>>>      C         -1.09275642     -1.01415419     -1.01302123
>>>> .........
>>>> }
>>>> ks,b3lyp
>>>>
>>>> What am I doing wrong here ?
>>>>
>>>> Thank you in advance for your help!
>>>>
>>>> Best Regards,
>>>> Anatoliy
>>>> ---------------------------
>>>> Anatoliy Volkov, Ph.D.
>>>> Associate Professor
>>>> Department of Chemistry
>>>> Middle Tennessee State University
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Molpro-user mailing list
>>>> Molpro-user at molpro.net
>>>> http://www.molpro.net/mailman/listinfo/molpro-user
>>>>
>>>
>>>
>>>
>>
>>
>
>
>