# Running Molpro on parallel computers

Molpro will run on distributed-memory multiprocessor systems, including workstation clusters, under the control of the Global Arrays parallel toolkit or the MPI-2 library. There are also some parts of the code that can take advantage of shared memory parallelism through the OpenMP protocol, although these are somewhat limited, and this facility is not at present recommended. It should be noted that there remain some parts of the code that are not, or only partly, parallelized, and therefore run with replicated work. Additionally, some of those parts which have been parallelized rely on fast inter-node communications, and can be very inefficient across ordinary networks. Therefore some caution and experimentation is needed to avoid waste of resources in a multiuser environment.

Molpro effects interprocess cooperation through the ppidd library, which, depending on how it was configured and built, draws on either the Global Arrays parallel toolkit or pure MPI. ppidd is described in Comp. Phys. Commun. 180, 2673-2679 (2009).

Global Arrays (GAs) handles distributed data objects using one-sided remote memory access facilities. There are different GA implementation options available, and there are advantages and disadvantages for using one or the other implementation. Tt is therefore very important to read and understand sections GA Installation notes and memory specifications before trying to run large scale parallel calculations with Molpro.

In the case of the MPI implementation, there is a choice of using either MPI-2 one-sided memory access, or devoting some of the processes to act as data ‘helpers’. It is generally found that performance is significantly better if at least one dedicated helper is used, and in some cases it is advisable to specify more. The scalable limit is to devote one core on each node in a typical multi-core cluster machine, but in most cases it is possible to manage with fewer, thereby making more cores available for computation. This aspect of configuration can be tuned through the *-helper-server options described below.

There are different ways to configure the GA library which affect the excution and options, and they are therefore important from a users point of view. The current default of the GA installation is to use –with-mpi-ts, which is VERY SLOW! According to our experience best performance is achieved –with-sockets for single workstations and servers without fast interconnections, or –with-openib) for nodes that are interconnected by infiniband. These models use communication interfaces provided by GA itself and do not need to any helper processeses. A serious disadvantage is, however, that with these models the GA software frequently crashes if several GAs are allocated whose total size exceeds 16 GB (2 MW). This can be avoided by allocating in the beginning of the job a very large GA with at least the size of the total GA space needed in the calculation (which is not known in advance). This can be done with the -G or -M molpro options, see section memory specifications for details.

Alternatively, a communication model based on MPI (–with-mpi-pr) can be used (for single and multi-node calculations, including Infiniband). This needs one helper process for each node. Thus, if for example -n40 is specified as molpro options in a 2-node calculation, only 38 MPI processes run molpro, and 2 additional helper processes are started. For the same number of molpro mpi processes on a single node, mpi-pr is only very slightly (2-5%) slower than –with-sockets or –with-openib). However, we find multi-node calculations (–with-mpi-pr) being significantly slower than –with-openib (this may be machine dependent). However, GA preallocation is not needed in this case, and therefore the mpi-pr model is easier to use.

We note that we experienced problems in large calculations with –with-mpi-pr with recent openmpi versions. Accoring to our experience molpro calculations are most stable with the rather old openmpi_2.0.2 version. Another problem with the current GA/mpi-pr version is that even in successful calculations shared memory segments remain in /dev/shm. These can accumulate so much that a machine can become unusable, and should therefore be deleted from time to time. In addition, it may be necessary to setting the environment variable ARMCI_DEFAULT_SHMMAX to a large value.

The following additional options for the molpro command may be used to specify and control parallel execution. In addition, appropriate memory specifications (-m, -M, -G) are important, see section memory specifications.

• -n $|$ --tasks tasks/tasks_per_node:smp_threads tasks specifies the number of parallel processes to be set up, and defaults to 1. tasks_per_node sets the number of GA(or MPI-2) processes to run on each node, where appropriate. The default is installation dependent. In some environments (e.g., IBM running under Loadleveler; PBS batch job), the value given by -n is capped to the maximum allowed by the environment; in such circumstances it can be useful to give a very large number as the value for -n so that the control of the number of processes is by the batch job specification. smp_threads relates to the use of OpenMP shared-memory parallelism, and specifies the maximum number of OpenMP threads that will be opened, and defaults to 1. Any of these three components may be omitted, and appropriate combinations will allow GA(or MPI-2)-only, OpenMP-only, or mixed parallelism.
• -N $|$ --task-specification user1:node1:tasks1,user2:node2:tasks2$\dots$ node1, node2 etc. specify the host names of the nodes on which to run. On most parallel systems, node1 defaults to the local host name, and there is no default for node2 and higher. On Cray T3E and IBM SP systems, and on systems running under the PBS batch system, if -N is not specified, nodes are obtained from the system in the standard way. tasks1, tasks2 etc. may be used to control the number of tasks on each node as a more flexible alternative to -n / tasks_per_node. If omitted, they are each set equal to -n / tasks_per_node. user1, user2 etc. give the username under which processes are to be created. Most of these parameters may be omitted in favour of the usually sensible default values.
• -S $|$ --shared-file-implementation method specifies the method by which the shared data are held in parallel. method can be sf or ga, and it is set automatically according to the properties of scratch directories by default. If method is manually set to sf, please ensure all the scratch directories are shared by all processes. Note that for GA version of Molpro, if method is set to sf manually or by default, the scratch directories can’t be located in NFS when running molpro job on multiple nodes. The reason is that the SF facility in Global Arrays doesn’t work well on multiple nodes with NFS. There is no such restriction for MPI-2 version of Molpro.
• --multiple-helper-server nprocs_per_server enables the multiple helper servers, and nprocs_per_server sets how many processes own one helper server. For example, when total number of processes is specified as $32$ and $nprocs\_per\_server=8$, then every $8$ processes(including helper server) will own one helper server, and there are $4$ helper servers in total. For any unreasonable value of $nprocs\_per\_server$ (i.e., any integer less than 2), it will be reset to a very large number automatically, and this will be equivalent to option --single-helper-server.
• --node-helper-server specifies one helper server on every node if all the nodes are symmetric and have reasonable processes (i.e., every node has the same number of processes, and the number should be greater than 1), and this is the default behaviour. Otherwise, only one single helper server for all processes/nodes will be used, and this will be equivalent to option --single-helper-server
• --single-helper-server specifies only one single helper server for all processes.
• --no-helper-server disables the helper server.
• -t $|$ --omp-num-threads n Specify the number of OpenMP threads, as if the environment variable OMP_NUM_THREADS were set to n.

Note that options --multiple-helper-server, --node-helper-server,
--single-helper-server, and --no-helper-server are only effective for Molpro built with MPI-2 library. In the cases of one or more helper servers enabled, one or more processes act as data helper servers, and the rest processes are used for computation. Even so, it is quite competitive in performance when it is run with a large number of processes. In the case of helper server disabled, all processes are used for computation; however, the performance may not be good because of the poor performance of some existing implementations of the MPI-2 standard for one-sided operations.

Large scale parallel Molpro calculations may require a lot of of GA space. This concerns in particular pno-lccsd calculations (cf. section local correlation methods with pair natural orbitals (PNOs)), and, to a lesser extent, also Hartree-Fock, DFT, and MCSCF/CASSCF calculations with density fitting. If GA/sockets or GA/openib is used this may require to preallocate as much GA space as possible (see section GA Installation notes). Thus, it is necessary to share the available memory of the machine between molpro stack memory (determined by -m) and the GA memory (determined by the -G option). Both are by default given in megawords (m) but unit gigaword (g) can also be used (e.g. -m1000 is equivalent to -m1000m and to -m1g). The total memory $M$ per node allocated by molpro amounts to $(n \cdot m+G)/N$, where $n$ is the total number of processes (-n option), $m$ is the stack memory per process (-m option), $G$ the GA memory (-G option), and $N$ the number of nodes. In addition, at least 0.3 gw per process should be added for the program itself. In total, a calculation needs about $8\cdot[n\cdot(m+0.3)+G]/N$ GB (gigabytes) of memory ($n,m,G$ in gw), and this should not exceed the physical memory of the machine(s) used.

For canonical MRCI or CCSD(T) calculations on one node no GA space is needed and -G does not need to be specified. On the other hand, for PNO-CCSD(T)-F12 calculation on extended molecules, large GAs are needed, and a good rule of thumb is to divide the memory into equal parts for GA and stack memory. In order to facilitate this, the -M option is provided (in the following, its value is denoted $M$). With this, the total memory allocatable by Molpro can be specified. In density fitting (DF) and PNO calculations the memory is split into equal parts for stack and GA, other calculations use 80% for stack and 20% for GA. Thus, unless specified otherwise, in DF/PNO calculations the stack memory per process is $m=M\cdot N/(2\cdot n)$ and the total GA memory is $G=N\cdot M/2$.

It is recommended to provide a default -M value in .molprorc, e.g. -M=25g for a dedicated machine with 256 GB of memory and 20 cores (.molprorc can be in the home directory and/or in the submission directory, the latter having preference). Then each Molpro run would be able to use the whole memory of the machine with reasonable splitting between stack and GA. The default can be overwritten or modified by molpro command line options -m and/or -G, or by input options (cf. section memory allocation), the latter having preference over command line options.

If the -G or -M options are given, some programs check at early stages if the GA space is sufficient. If not, an error exit occurs and the estimated amount of required GA is printed. In this case the calculation should be repeated, specifying (at least) the printed amount of GA space with the -G option. If crashes without such message occur, the calculation should also be repeated with more GA space, but in any case care should be taken that the total memory per node does not get too large.

The behavior of various option combinations is as follows:

• -M As described above.
• -M and -m The specified amount $m$ is allocated for each core, the the remaining memory for GA.
• -M and -G The specified amount $G$ is allocated for GA, and the remaining amount is split equally for stack memory of each process.
• -M and -G and -m The specified amounts of $m$ and $G$ are allocated, and the $M$ value is ignored.
• -G and -m The specified amounts of $m$ and $G$ are allocated.
• -m The specified amount of $m$ is allocated. An infinite amount of GA space is assumed, but nothing is preallocated or checked.
• -G The specified amount of $G$ is allocated, and the same total amount for stack memory (i.e. $M=2G$).
• nothing Same as -m32m.

If the -G or -M are present, the GA space is preallocated except with GA/mpi-pr. If neither -G nor -M are given, no preallocation and no checks of GA space are performed.