[molpro-user] running parallel GA Molpro from under Torque?

Andy May MayAJ1 at cardiff.ac.uk
Wed Nov 27 09:51:41 GMT 2013


Grigory,

Firstly, I can only echo Jeff's comments about MVAPICH2 over OpenMPI for 
the case of using Infiniband.

Secondly, you need to decide which MPI library you really want to use. 
You have given both a '-mppbase' option, and a '-auto-ga-openmpi' 
option, but these are mutually exclusive. The auto option will build 
everything, including a copy of OpenMPI, however it's use is best 
restricted to a single machine. If you want a GA/openmpi build for your 
cluster I strongly recommend to build GA against your system MPI 
library, and then pass the GA build directory via -mppbase to Molpro. 
So, removing the options which are default anyway, your configure would 
look like:

./configure -mpp -mppbase /path/to/GA/build/dir -icc -ifort -slater -noboost

and of course PATH should contain the MPI executables.

Best wishes,

Andy

On 26/11/13 20:51, Grigory Shamov wrote:
> Hi All,
>
> I was trying to install MolPro 12.1 for one of our users. We have an
> Infinibad, Linux cluster, with OpenMPI 1.6 and Torque, the later has
> CPUsets enabled. I wanted to build MolPro from sources, using our OpenMPI
> so that hopefully it would be Torque-aware, run within CPUsets allocated
> for it, etc.
>
> I have chosen the GA version, as it seems to depend less on shared
> filesystem. I've used Intel 12.1 compilers and MKL. The configure line was:
>
> ./configure -blas -lapack -mpp -mppbase
> /global/software/openmpi-1.6.1-intel1/include -icc -ifort -x86_64 -i8
> -nocuda -noopenmp -slater -auto-ga-openmpi -noboost
>
>
> It all went really smooth, and make test on a single SMP node would run
> successfully. So do MolPro batch jobs, within a single node
> (nodes=1:ppn=N); however, when I try to run it across the nodes, it fails
> with some ARMCI errors. Here is my script:
>
>
> #!/bin/bash
> #PBS -l pmem=1gb,nodes=1:ppn=4
>
> cd $PBS_O_WORKDIR
> echo "Current working directory is `pwd`"
> echo "Starting run at: `date`"
>
> echo "TMPDIR is $TMPDIR"
> export TMPDIR4=$TMPDIR
>
> #Add molpro directory to the path
> export PATH=$PATH:$HOME/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/bin
>
> #Run molpro in parallel
>
> export ARMCI_DEFAULT_SHMMAX=1800
> molpro -v  -S ga auh_ecp_lib.com
>
> # all done.
>
>
>
> If I change "nodes=1:ppn=4" to, say, "procs=8" I get the following output
> with the failure:
>
>
> export AIXTHREAD_SCOPE='s'
>   export
> MOLPRO_PREFIX='/home/abrown/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8'
>   export MP_NODES='0'
>   export MP_PROCS='8'
>          MP_TASKS_PER_NODE=''
>   export MOLPRO_NOARG='1'
>   export MOLPRO_OPTIONS=' -v -S ga auh_ecp_lib.com'
>   export
> MOLPRO_OPTIONS_FILE='/scratch/6942419.yak.local/molpro_options.12505'
>          MPI_MAX_CLUSTER_SIZE=''
>          MV2_ENABLE_AFFINITY=''
>   export RT_GRQ='ON'
>          TCGRSH=''
>   export TMPDIR='/scratch/6942419.yak.local'
>   export XLSMPOPTS='parthds=1'
> /home/user/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/src/openmpi-instal
> l/bin/mpirun --mca mpi_warn_on_fork 0 -machinefile
> /scratch/6942419.yak.local/procgrp.12505 -np 8
> /home/user/molpro-12.1-ga/molprop_2012_1_Linux_x86_64_i8/bin/molpro.exe
> -v -S ga auh_ecp_lib.com
> -10005:Segmentation Violation error, status=: 11
> (rank:-10005 hostname:n211 pid:29570):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigSegvHandler():310 cond:0
> 5:Child process terminated prematurely, status=: 11
> (rank:5 hostname:n211 pid:29560):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigChldHandler():178 cond:0
> -10004:Segmentation Violation error, status=: 11
> (rank:-10004 hostname:n227 pid:13769):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigSegvHandler():310 cond:0
> -10003:Segmentation Violation error, status=: 11
> (rank:-10003 hostname:n232 pid:15721):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigSegvHandler():310 cond:0
> 3:Child process terminated prematurely, status=: 11
> (rank:3 hostname:n232 pid:15717):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigChldHandler():178 cond:0
> 4:Child process terminated prematurely, status=: 11
> (rank:4 hostname:n227 pid:13765):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigChldHandler():178 cond:0
> -10000:Segmentation Violation error, status=: 11
> (rank:-10000 hostname:n242 pid:12666):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigSegvHandler():310 cond:0
> 0:Child process terminated prematurely, status=: 11
> (rank:0 hostname:n242 pid:12652):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigChldHandler():178 cond:0
>
>
> The std err has also some ARMCI messages:
>
> ARMCI master: wait for child process (server) failed:: No child processes
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 3 in communicator MPI COMMUNICATOR 4 DUP
> FROM 0
> with errorcode 11.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> ARMCI master: wait for child process (server) failed:: No child processes
> ARMCI master: wait for child process (server) failed:: No child processes
> ARMCI master: wait for child process (server) failed:: No child processes
> forrtl: error (78): process killed (SIGTERM)
> Image              PC                Routine            Line        Source
> molpro.exe         000000000529533A  Unknown               Unknown  Unknown
> molpro.exe         0000000005293E36  Unknown               Unknown  Unknown
> molpro.exe         0000000005240270  Unknown               Unknown  Unknown
> molpro.exe         00000000051D4F2E  Unknown               Unknown  Unknown
> molpro.exe         00000000051DD5D3  Unknown               Unknown  Unknown
> molpro.exe         0000000004F6C9ED  Unknown               Unknown  Unknown
> molpro.exe         0000000004F4C367  Unknown               Unknown  Unknown
> molpro.exe         0000000004F6C8AB  Unknown               Unknown  Unknown
> libc.so.6          0000003B40632920  Unknown               Unknown  Unknown
> libpthread.so.0    0000003B40A0C170  Unknown               Unknown  Unknown
> libmlx4-m-rdmav2.  00002B0147A155FE  Unknown               Unknown  Unknown
>
>
>
>
> Could you please suggest what I am doing wrong, and how do I run GA
> version of MolPro correctly in parallel, across the nodes? Any suggestions
> would be really appreciated! Thanks!
>



More information about the Molpro-user mailing list