[molpro-user] Non-reproducible stuck state when running Molpro on NFS drive

Gregory Magoon gmagoon at MIT.EDU
Fri Jul 15 19:17:42 BST 2011


Hi,
 I have successfully compiled molpro (with Global Arrays/TCGMSG; mpich2 from
Ubuntu package) on one of our compute nodes for our new server, and installed
it in an NFS directory on our head node. The initial tests on the compute node
ran fine but since the installation, I've had issues with running molpro on the
compute nodes (it seems to work fine on the head node). Sometimes (sorry I can't
be more precise, but it does not seem to be reproducible), when running on the
compute node, the job will get stuck in the early stages, producing a lot (~14+
Mbps outbound to headnode and 7Mbps inbound from headnode) of NFS traffic and
causing fairly high nfsd process CPU% usage on the head node. Molpro processes
in the stuck state are shown in "top" command display at the bottom of the
e-mail. I have also attached example verbose output for a case that works and a
case that gets stuck.

Some notes:
-/usr/local is mounted as NFS read-only file system; /home is mounted as NFS rw
file system
-It seems like runs with fewer processors (e.g. 6) are more likely to run
successfully

I've tried several approaches for addressing the issue, including 1. Mounting
/usr/local as rw file system, and 2. Changing the rsize and wsize parameters
for the NFS filesystem but none seem to work. We also tried piping < /dev/null
when calling the process, which seemed like it was helping at first, but later
tests suggested that this wasn't actually helping.

If anyone has any tips or ideas to help diagnose the issue here, it would be
greatly appreciated. If there are any additional details I can provide to help
describe the problem, I'd be happy to provide them.

Thanks very much,
Greg

Top processes in "top" output in stuck state:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
   10 root      20   0     0    0    0 S   10  0.0   0:16.50 kworker/0:1
    2 root      20   0     0    0    0 S    6  0.0   0:10.86 kthreadd
 1496 root      20   0     0    0    0 S    1  0.0   0:04.73 kworker/0:2
    3 root      20   0     0    0    0 S    1  0.0   0:00.93 ksoftirqd/0

Processes in "top" output for user in stuck state:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29961 user      20   0 19452 1508 1072 R    0  0.0   0:00.05 top
 1176 user      20   0 91708 1824  868 S    0  0.0   0:00.01 sshd
 1177 user      20   0 24980 7620 1660 S    0  0.0   0:00.41 bash
 1289 user      20   0 91708 1824  868 S    0  0.0   0:00.00 sshd
 1290 user      20   0 24980 7600 1640 S    0  0.0   0:00.32 bash
 1386 user      20   0  4220  664  524 S    0  0.0   0:00.01 molpro
 1481 user      20   0 18764 1196  900 S    0  0.0   0:00.00 mpiexec
 1482 user      20   0 18828 1092  820 S    0  0.0   0:00.00 hydra_pmi_proxy
 1483 user      20   0 18860  488  212 D    0  0.0   0:00.00 hydra_pmi_proxy
 1484 user      20   0 18860  488  212 D    0  0.0   0:00.00 hydra_pmi_proxy
 1485 user      20   0 18860  488  212 D    0  0.0   0:00.00 hydra_pmi_proxy
 1486 user      20   0 18860  488  212 D    0  0.0   0:00.00 hydra_pmi_proxy
 1487 user      20   0 18860  488  212 D    0  0.0   0:00.00 hydra_pmi_proxy
 1488 user      20   0 18860  488  212 D    0  0.0   0:00.00 hydra_pmi_proxy
 1489 user      20   0 18860  488  212 D    0  0.0   0:00.00 hydra_pmi_proxy
 1490 user      20   0 18860  488  208 D    0  0.0   0:00.00 hydra_pmi_proxy
 1491 user      20   0 18860  488  208 D    0  0.0   0:00.00 hydra_pmi_proxy
 1492 user      20   0 18860  488  208 D    0  0.0   0:00.00 hydra_pmi_proxy
 1493 user      20   0 18860  488  208 D    0  0.0   0:00.00 hydra_pmi_proxy
 1494 user      20   0 18860  492  212 D    0  0.0   0:00.00 hydra_pmi_proxy



-------------- next part --------------
 # PARALLEL mode
 nodelist=12
 first   =12
 second  =
 third   =
 HOSTFILE_FORMAT: $hostname

node02
node02
node02
node02
node02
node02
node02
node02
node02
node02
node02
node02

 export LD_LIBRARY_PATH='/opt/acml4.4.0/gfortran64_int64/lib:'
 export AIXTHREAD_SCOPE='s'
 export INSTLIB='/usr/local/molpro2010.1/lib/molprop_2010_1_Linux_x86_64_i8'
 export MP_NODES='0'
 export MP_PROCS='12'
        MP_TASKS_PER_NODE=''
 export MOLPRO_NOARG='1'
 export MOLPRO_OPTIONS=' -v -d /tmp/user -m 250M test3.inp'
 export MOLPRO_OPTIONS_FILE='/tmp/molpro_options.1109'
        MPI_MAX_CLUSTER_SIZE=''
 export PROCGRP='/tmp/procgrp.1109'
 export RT_GRQ='ON'
        TCGRSH=''
        TMPDIR=''
 export XLSMPOPTS='parthds=1'
/usr/bin/mpiexec -machinefile /tmp/procgrp.1109 -np 12 /usr/local/molpro2010.1/bin/molprop_2010_1_Linux_x86_64_i8.exe  -v -d /tmp/user -m 250M test3.inp
-------------- next part --------------
 # PARALLEL mode
 nodelist=12
 first   =12
 second  =
 third   =
 HOSTFILE_FORMAT: $hostname

node01
node01
node01
node01
node01
node01
node01
node01
node01
node01
node01
node01

 export LD_LIBRARY_PATH='/opt/acml4.4.0/gfortran64_int64/lib:'
 export AIXTHREAD_SCOPE='s'
 export INSTLIB='/usr/local/molpro2010.1/lib/molprop_2010_1_Linux_x86_64_i8'
 export MP_NODES='0'
 export MP_PROCS='12'
        MP_TASKS_PER_NODE=''
 export MOLPRO_NOARG='1'
 export MOLPRO_OPTIONS=' -v -d /tmp/user -m 250M test3.inp'
 export MOLPRO_OPTIONS_FILE='/tmp/molpro_options.1229'
        MPI_MAX_CLUSTER_SIZE=''
 export PROCGRP='/tmp/procgrp.1229'
 export RT_GRQ='ON'
        TCGRSH=''
        TMPDIR=''
 export XLSMPOPTS='parthds=1'
/usr/bin/mpiexec -machinefile /tmp/procgrp.1229 -np 12 /usr/local/molpro2010.1/bin/molprop_2010_1_Linux_x86_64_i8.exe  -v -d /tmp/user -m 250M test3.inp
 token read from /usr/local/molpro2010.1/lib/molprop_2010_1_Linux_x86_64_i8//.token
 token read from /home/user/.molpro/token
 input from /home/user/molprowd/test3.inp
 output to /home/user/molprowd/test3.out
 XML stream to /home/user/molprowd/test3.xml
 Creating directory /tmp/user


More information about the Molpro-user mailing list