[molpro-user] running parallel GA Molpro from under Torque?

Grigory Shamov Grigory.Shamov at umanitoba.ca
Wed Nov 27 18:07:21 GMT 2013


Dear Andy,

Thanks for the reply! I've built GA-5.2 (the current one) manually, and
tried to pass its location with -mppbase, but configure then fails with

"unable to determine MPP type"

-- 
Grigory Shamov

HPC Analyst, Westgrid/Compute Canada
E2-588 EITC Building, University of Manitoba
(204) 474-9625





On 13-11-27 3:51 AM, "Andy May" <MayAJ1 at cardiff.ac.uk> wrote:

>Grigory,
>
>Firstly, I can only echo Jeff's comments about MVAPICH2 over OpenMPI for
>the case of using Infiniband.
>
>Secondly, you need to decide which MPI library you really want to use.
>You have given both a '-mppbase' option, and a '-auto-ga-openmpi'
>option, but these are mutually exclusive. The auto option will build
>everything, including a copy of OpenMPI, however it's use is best
>restricted to a single machine. If you want a GA/openmpi build for your
>cluster I strongly recommend to build GA against your system MPI
>library, and then pass the GA build directory via -mppbase to Molpro.
>So, removing the options which are default anyway, your configure would
>look like:
>
>./configure -mpp -mppbase /path/to/GA/build/dir -icc -ifort -slater
>-noboost
>
>and of course PATH should contain the MPI executables.
>
>Best wishes,
>
>Andy
>
>On 26/11/13 20:51, Grigory Shamov wrote:
>> Hi All,
>>
>> I was trying to install MolPro 12.1 for one of our users. We have an
>> Infinibad, Linux cluster, with OpenMPI 1.6 and Torque, the later has
>> CPUsets enabled. I wanted to build MolPro from sources, using our
>>OpenMPI
>> so that hopefully it would be Torque-aware, run within CPUsets allocated
>> for it, etc.
>>
>> I have chosen the GA version, as it seems to depend less on shared
>> filesystem. I've used Intel 12.1 compilers and MKL. The configure line
>>was:
>>
>> ./configure -blas -lapack -mpp -mppbase
>> /global/software/openmpi-1.6.1-intel1/include -icc -ifort -x86_64 -i8
>> -nocuda -noopenmp -slater -auto-ga-openmpi -noboost
>>
>>
>> It all went really smooth, and make test on a single SMP node would run
>> successfully. So do MolPro batch jobs, within a single node
>> (nodes=1:ppn=N); however, when I try to run it across the nodes, it
>>fails
>> with some ARMCI errors. Here is my script:
>>
>>
>> #!/bin/bash
>> #PBS -l pmem=1gb,nodes=1:ppn=4
>>
>> cd $PBS_O_WORKDIR
>> echo "Current working directory is `pwd`"
>> echo "Starting run at: `date`"
>>
>> echo "TMPDIR is $TMPDIR"
>> export TMPDIR4=$TMPDIR
>>
>> #Add molpro directory to the path
>> export 
>>PATH=$PATH:$HOME/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/bin
>>
>> #Run molpro in parallel
>>
>> export ARMCI_DEFAULT_SHMMAX=1800
>> molpro -v  -S ga auh_ecp_lib.com
>>
>> # all done.
>>
>>
>>
>> If I change "nodes=1:ppn=4" to, say, "procs=8" I get the following
>>output
>> with the failure:
>>
>>
>> export AIXTHREAD_SCOPE='s'
>>   export
>> 
>>MOLPRO_PREFIX='/home/abrown/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8
>>'
>>   export MP_NODES='0'
>>   export MP_PROCS='8'
>>          MP_TASKS_PER_NODE=''
>>   export MOLPRO_NOARG='1'
>>   export MOLPRO_OPTIONS=' -v -S ga auh_ecp_lib.com'
>>   export
>> MOLPRO_OPTIONS_FILE='/scratch/6942419.yak.local/molpro_options.12505'
>>          MPI_MAX_CLUSTER_SIZE=''
>>          MV2_ENABLE_AFFINITY=''
>>   export RT_GRQ='ON'
>>          TCGRSH=''
>>   export TMPDIR='/scratch/6942419.yak.local'
>>   export XLSMPOPTS='parthds=1'
>> 
>>/home/user/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/src/openmpi-inst
>>al
>> l/bin/mpirun --mca mpi_warn_on_fork 0 -machinefile
>> /scratch/6942419.yak.local/procgrp.12505 -np 8
>> /home/user/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/bin/molpro.exe
>> -v -S ga auh_ecp_lib.com
>> -10005:Segmentation Violation error, status=: 11
>> (rank:-10005 hostname:n211 pid:29570):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>> 5:Child process terminated prematurely, status=: 11
>> (rank:5 hostname:n211 pid:29560):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigChldHandler():178 cond:0
>> -10004:Segmentation Violation error, status=: 11
>> (rank:-10004 hostname:n227 pid:13769):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>> -10003:Segmentation Violation error, status=: 11
>> (rank:-10003 hostname:n232 pid:15721):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>> 3:Child process terminated prematurely, status=: 11
>> (rank:3 hostname:n232 pid:15717):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigChldHandler():178 cond:0
>> 4:Child process terminated prematurely, status=: 11
>> (rank:4 hostname:n227 pid:13765):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigChldHandler():178 cond:0
>> -10000:Segmentation Violation error, status=: 11
>> (rank:-10000 hostname:n242 pid:12666):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>> 0:Child process terminated prematurely, status=: 11
>> (rank:0 hostname:n242 pid:12652):ARMCI DASSERT fail.
>> src/common/signaltrap.c:SigChldHandler():178 cond:0
>>
>>
>> The std err has also some ARMCI messages:
>>
>> ARMCI master: wait for child process (server) failed:: No child
>>processes
>> 
>>-------------------------------------------------------------------------
>>-
>> MPI_ABORT was invoked on rank 3 in communicator MPI COMMUNICATOR 4 DUP
>> FROM 0
>> with errorcode 11.
>>
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> 
>>-------------------------------------------------------------------------
>>-
>> ARMCI master: wait for child process (server) failed:: No child
>>processes
>> ARMCI master: wait for child process (server) failed:: No child
>>processes
>> ARMCI master: wait for child process (server) failed:: No child
>>processes
>> forrtl: error (78): process killed (SIGTERM)
>> Image              PC                Routine            Line
>>Source
>> molpro.exe         000000000529533A  Unknown               Unknown
>>Unknown
>> molpro.exe         0000000005293E36  Unknown               Unknown
>>Unknown
>> molpro.exe         0000000005240270  Unknown               Unknown
>>Unknown
>> molpro.exe         00000000051D4F2E  Unknown               Unknown
>>Unknown
>> molpro.exe         00000000051DD5D3  Unknown               Unknown
>>Unknown
>> molpro.exe         0000000004F6C9ED  Unknown               Unknown
>>Unknown
>> molpro.exe         0000000004F4C367  Unknown               Unknown
>>Unknown
>> molpro.exe         0000000004F6C8AB  Unknown               Unknown
>>Unknown
>> libc.so.6          0000003B40632920  Unknown               Unknown
>>Unknown
>> libpthread.so.0    0000003B40A0C170  Unknown               Unknown
>>Unknown
>> libmlx4-m-rdmav2.  00002B0147A155FE  Unknown               Unknown
>>Unknown
>>
>>
>>
>>
>> Could you please suggest what I am doing wrong, and how do I run GA
>> version of MolPro correctly in parallel, across the nodes? Any
>>suggestions
>> would be really appreciated! Thanks!
>>




More information about the Molpro-user mailing list