Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
running_molpro_on_parallel_computers [2021/04/26 15:19]
qianli [Memory specifications]
running_molpro_on_parallel_computers [2021/06/02 08:55] (current)
qianli [Memory specifications]
Line 3: Line 3:
 Molpro will run on distributed-memory multiprocessor systems, including workstation clusters, under the control of the [[http://www.emsl.pnl.gov/docs/global|Global Arrays]] parallel toolkit or the MPI-2 library. There are also some parts of the code that can take advantage of shared memory parallelism through the [[http://www.openmp.org|OpenMP]] protocol, although these are somewhat limited, and this facility is not at present recommended. It should be noted that there remain some parts of the code that are not, or only partly, parallelized, and therefore run with replicated work. Additionally, some of those parts which have been parallelized rely on fast inter-node communications, and can be very inefficient across ordinary networks. Therefore some caution and experimentation is needed to avoid waste of resources in a multiuser environment. Molpro will run on distributed-memory multiprocessor systems, including workstation clusters, under the control of the [[http://www.emsl.pnl.gov/docs/global|Global Arrays]] parallel toolkit or the MPI-2 library. There are also some parts of the code that can take advantage of shared memory parallelism through the [[http://www.openmp.org|OpenMP]] protocol, although these are somewhat limited, and this facility is not at present recommended. It should be noted that there remain some parts of the code that are not, or only partly, parallelized, and therefore run with replicated work. Additionally, some of those parts which have been parallelized rely on fast inter-node communications, and can be very inefficient across ordinary networks. Therefore some caution and experimentation is needed to avoid waste of resources in a multiuser environment.
  
-Molpro effects interprocess cooperation through the //ppidd// library, which, depending on how it was configured and built, draws on either the [[http://www.emsl.pnl.gov/docs/global|Global Arrays]] parallel toolkit or pure MPI. //ppidd// is described in [[http://dx.doi.org/10.1016/j.cpc.2009.05.002|Comp. Phys. Commun. 180, 2673-2679 (2009)]].+Molpro effects interprocess cooperation through MPI and the //ppidd// library, which, depending on how it was configured and built, draws on either the [[http://www.emsl.pnl.gov/docs/global|GlobalArrays]] (GA) parallel toolkit or pure MPI. //ppidd// is described in [[http://dx.doi.org/10.1016/j.cpc.2009.05.002|Comp. Phys. Commun. 180, 2673-2679 (2009)]]
 +The use of GlobalArrays (GAs) is recommended for performances considerations. 
 +There are different GA implementation options (runtimes), and there are advantages and disadvantages for using one or the other implementation (see [[GA Installation]]).
  
-Global Arrays (GAs) handles distributed data objects using one-sided remote memory access facilitiesThere are different GA implementation options available, and there are advantages and disadvantages for using one or the other implementation (see [[GA Installation]])+**Since Molpro 2021.the [[#disk option]] is used by default** in single node calculation, in which case large data structures are simply kept in MPI files
-The Molpro binary is built with the ''--with-sockets'' GA runtimewhich requires pre-allocation of GA memory in some large calculations+The behavior of previous versions can be recovered by the ''--ga-impl ga'' command line option. 
-**Failing to pre-allocate GAs in large calculations may lead to crashes or incorrect results, unless the [[#disk option]] is used.** +However''--ga-impl ga'' requires pre-allocation of GA memory in many calculations if the ''socket'' GA runtime is used, and failing to preallocate sufficient amount of GA memory may lead to crashes or incorrect results. 
-It is therefore very important to read and understand [[#memory specifications]] before trying to run large scale parallel calculations with Molpro. +Preallocating GA is not required with the ''mpi-pr'' runtime of GA or with the disk option.
- +
-In the case of the MPI implementation, there is a choice of using either MPI-2 one-sided memory access, or devoting some of the processes to act as data "helpers". It is generally found that performance is significantly better if at least one dedicated helper is used, and in some cases it is advisable to specify more. The scalable limit is to devote one core on each node in a typical multi-core cluster machine, but in most cases it is possible to manage with fewer, thereby making more cores available for computation. This aspect of configuration can be tuned through the ''%%*-helper-server%%'' options described below.+
  
 ===== GA Installation notes  ===== ===== GA Installation notes  =====
Line 21: Line 21:
   * **''-n'' $|$ ''%%--tasks%%'' //tasks/tasks_per_node:smp_threads//** //tasks// specifies the number of parallel processes to be set up, and defaults to 1. //tasks_per_node// sets the number of GA(or MPI-2) processes to run on each node, where appropriate. The default is installation dependent. In some environments (e.g., IBM running under Loadleveler; PBS batch job), the value given by ''-n'' is capped to the maximum allowed by the environment; in such circumstances it can be useful to give a very large number as the value for ''-n'' so that the control of the number of processes is by the batch job specification. //smp_threads// relates to the use of OpenMP shared-memory parallelism, and specifies the maximum number of OpenMP threads that will be opened, and defaults to 1. Any of these three components may be omitted, and appropriate combinations will allow GA(or MPI-2)-only, OpenMP-only, or mixed parallelism.   * **''-n'' $|$ ''%%--tasks%%'' //tasks/tasks_per_node:smp_threads//** //tasks// specifies the number of parallel processes to be set up, and defaults to 1. //tasks_per_node// sets the number of GA(or MPI-2) processes to run on each node, where appropriate. The default is installation dependent. In some environments (e.g., IBM running under Loadleveler; PBS batch job), the value given by ''-n'' is capped to the maximum allowed by the environment; in such circumstances it can be useful to give a very large number as the value for ''-n'' so that the control of the number of processes is by the batch job specification. //smp_threads// relates to the use of OpenMP shared-memory parallelism, and specifies the maximum number of OpenMP threads that will be opened, and defaults to 1. Any of these three components may be omitted, and appropriate combinations will allow GA(or MPI-2)-only, OpenMP-only, or mixed parallelism.
   * **''-N'' $|$ ''%%--task-specification%%'' //user1:node1:tasks1,user2:node2:tasks2$\dots$//** //node1, node2// etc. specify the host names of the nodes on which to run. On most parallel systems, node1 defaults to the local host name, and there is no default for node2 and higher. On Cray T3E and IBM SP systems, and on systems running under the PBS batch system, if -N is not specified, nodes are obtained from the system in the standard way. //tasks1, tasks2// etc. may be used to control the number of tasks on each node as a more flexible alternative to ''-n'' / //tasks_per_node//. If omitted, they are each set equal to ''-n'' / //tasks_per_node//. //user1, user2// etc. give the username under which processes are to be created. Most of these parameters may be omitted in favour of the usually sensible default values.   * **''-N'' $|$ ''%%--task-specification%%'' //user1:node1:tasks1,user2:node2:tasks2$\dots$//** //node1, node2// etc. specify the host names of the nodes on which to run. On most parallel systems, node1 defaults to the local host name, and there is no default for node2 and higher. On Cray T3E and IBM SP systems, and on systems running under the PBS batch system, if -N is not specified, nodes are obtained from the system in the standard way. //tasks1, tasks2// etc. may be used to control the number of tasks on each node as a more flexible alternative to ''-n'' / //tasks_per_node//. If omitted, they are each set equal to ''-n'' / //tasks_per_node//. //user1, user2// etc. give the username under which processes are to be created. Most of these parameters may be omitted in favour of the usually sensible default values.
-  * **''-S'' $|$ ''%%--shared-file-implementation%%'' //method//** specifies the method by which the shared data are held in parallel. //method// can be //sf// or //ga//, and it is set automatically according to the properties of scratch directories by default. If //method// is manually set to //sf//, please ensure all the scratch directories are shared by all processes. Note that for GA version of Molpro, if //method// is set to //sf// manually or by default, the scratch directories can’t be located in NFS when running ''molpro'' job on multiple nodes. The reason is that the SF facility in Global Arrays doesn’t work well on multiple nodes with NFS. There is no such restriction for MPI-2 version of Molpro. 
-  * **''%%--multiple-helper-server%%'' //nprocs_per_server//** enables the multiple helper servers, and //nprocs_per_server// sets how many processes own one helper server. For example, when total number of processes is specified as $32$ and $nprocs\_per\_server=8$, then every $8$ processes(including helper server) will own one helper server, and there are $4$ helper servers in total. For any unreasonable value of $nprocs\_per\_server$ (i.e., any integer less than 2), it will be reset to a very large number automatically, and this will be equivalent to option ''%%--single-helper-server%%''. 
-  * **''%%--node-helper-server%%''** specifies one helper server on every node if all the nodes are symmetric and have reasonable processes (i.e., every node has the same number of processes, and the number should be greater than 1), and this is the default behaviour. Otherwise, only one single helper server for all processes/nodes will be used, and this will be equivalent to option ''%%--single-helper-server%%'' 
-  * **''%%--single-helper-server%%''** specifies only one single helper server for all processes. 
-  * **''%%--no-helper-server%%''** disables the helper server. 
   * **''-t'' $|$ ''%%--omp-num-threads%%'' //n//** Specify the number of OpenMP threads, as if the environment variable ''OMP_NUM_THREADS'' were set to //n//.   * **''-t'' $|$ ''%%--omp-num-threads%%'' //n//** Specify the number of OpenMP threads, as if the environment variable ''OMP_NUM_THREADS'' were set to //n//.
- +  * **''%%--ga-impl%%'' //method//** specifies the method by which large data structure are held in parallel. Available options are ''GA'' (GlobalArraysdefault) for ''disk'' (MPI Filessee [[#disk option]]). This option is most relevant on the more recent programs such as Hartree-Fock, DFT, MCSCF/CASSCF, and the PNO programs. 
-Note that options ''%%--multiple-helper-server%%'', ''%%--node-helper-server%%'',\\ +  * **''-D'' $|$ ''%%--global-scratch%%'' //directory//** specifies the scratch directory for the program which is accessible by all processors in multi-node calculations. This only affects parallel calculations with the [[running Molpro on parallel computers#disk option|disk option]] 
-''%%--single-helper-server%%'', and ''%%--no-helper-server%%'' are only effective for Molpro built with MPI-2 library. In the cases of one or more helper servers enabled, one or more processes act as data helper servers, and the rest processes are used for computation. Even so, it is quite competitive in performance when it is run with a large number of processesIn the case of helper server disabled, all processes are used for computation; however, the performance may not be good because of the poor performance of some existing implementations of the MPI-2 standard for one-sided operations.+  * **''%%--all-outputs%%''** produces an output file for each process when running in parallel.
  
 ===== Memory specifications  ===== ===== Memory specifications  =====
  
-Large scale parallel Molpro calculations may require a lot of of GA space. This concerns in particular pno-lccsd calculations (cf. section [[local correlation methods with pair natural orbitals (PNOs)]]), and, to a lesser extent, also Hartree-Fock, DFT, and MCSCF/CASSCF calculations with density fittingIf GA/sockets or GA/openib is used this may require to preallocate as much GA space as possible (see [[GA Installation]]). Thus, it is necessary to share the available memory of the machine between molpro stack memory (determined by ''-m'') and the GA memory (determined by the ''-G'' option). Both are by default given in megawords (mbut unit gigaword (g) can also be used (e.g. ''-m1000'' is equivalent to ''-m1000m'' and to ''-m1g''). The total memory $M$ per node allocated by molpro amounts to $(n \cdot m+G)/N$, where $n$ is the total number of processes (''-n'' option), $m$ is the stack memory per process (''-m'' option), $G$ the GA memory (''-G'' option), and $N$ the number of nodes. In addition, at least 0.3 gw per process should be added for the program itself. In total, a calculation needs about $8\cdot[n\cdot(m+0.3)+G]/N$ GB (gigabytes) of memory ($n,m,G$ in gw), and this should not exceed the physical memory of the machine(s) used.+Large scale parallel Molpro calculations may involve significant amount of global data structure. 
 +This concerns in particular [[local correlation methods with pair natural orbitals (PNOs) | PNO-LCCSD]] calculations, and, to a lesser extent, also Hartree-Fock, DFT, and MCSCF/CASSCF calculations. 
 +For these calculations, it may be necessary to share the available memory of the machine between molpro "stackmemory (determined by the ''-m'' command line option or the ''memory'' card in the input file) and the GA memory (determined by the ''-G'' command line option):
  
-For canonical MRCI or CCSD(T) calculations on one node no GA space is needed and -G does not need to be specified. On the other hand, for [[Local correlation methods with pair natural orbitals (PNOs)|PNO-CCSD(T)-F12 calculations]] on extended moleculeslarge GAs are neededand a good rule of thumb is to divide the memory into equal parts for GA and stack memory. In order to facilitate this, the ''-M'' option is provided (in the followingits value is denoted $M$)With this, the total memory allocatable by Molpro can be specifiedIn density fitting (DF) and PNO calculations the memory is split into equal parts for stack and GA, other calculations use 80% for stack and 20% for GAThusunless specified otherwise, in DF/PNO calculations the stack memory per process is $m=M\cdot N/(2\cdot n)$ and the total GA memory is $G=N\cdot M/2$.+  If the [[#disk option]] is disabled (the defaultand one of the older GA runtimes (''sockets''''openib''etc.) is used (**including when using the Molpro binary release**): Sufficient amount of GA memory must be specified by the ''-G'' or ''-M'' option (see below) and pre-allocated by Molpro in the beginning of a calculationotherwise the calculation may crash or yields incorrect results. 
 +  - If the disk option is disabled and one of the comex-based GA runtimes (e.g. ''mpi-pr'') is usedor if the disk option is enabled but the scratch is in a tmpfs: the ''-G'' or ''-M'' option is not mandatory, but sufficient physical memory shall be left for the global data structure. 
 +  - If the [[#disk option]] is enabled and the scratch directory is located on a physical disk: the GA usage should be negligible and the **''-G'' or ''-M'' options should not be given**However, the performance of the calculation might be better if some memory is left for the system to buffer the I/O.
  
-It is recommended to provide a default ''-M'' value in .molprorc (**except for disk-based calculation that does not use GA, see [[#disk option]]**), e.g''-M=25g'' for a dedicated machine with 256 GB of memory and 20 cores (.molprorc can be in the home directory and/or in the submission directory, the latter having preference). Then each Molpro run would be able to use the whole memory of the machine with reasonable splitting between stack and GA. The default can be overwritten or modified by molpro command line options ''-m'' and/or ''-G'', or by input options (cf. section [[general program structure#memory allocation|memory allocation]]), the latter having preference over command line options.+Note that we have made the disk option the default in single node calculations since Molpro 2021.2If this causes performance problems, the previous behavior of storing large data structure in GlobalArrays can be enabled by setting the environment variable ''MOLPRO_GA_IMPL'' to ''GA'', or by passing the ''%%--ga-impl ga%%'' command-line option.
  
-If the ''-G'' or ''-M'' options are given, some programs check at early stages if the GA space is sufficient. If not, an error exit occurs and the estimated amount of required GA is printed. In this case the calculation should be repeated, specifying (at least) the printed amount of GA space with the ''-G'' option. If crashes without such message occur, the calculation should also be repeated with more GA space, but in any case care should be taken that the total memory per node does not get too large.+Both the ''-m'' and ''-G'' options are by default given in megawords (m) but unit gigaword (g) can also be used (e.g. ''-m1000'' is equivalent to ''-m1000m'' and to ''-m1g''). 
 +The total memory $M$ per node allocated by molpro amounts to $(n \cdot m+G)/N$, where $n$ is the total number of processes (''-n'' option), $m$ is the stack memory per process (''-m'' option), $G$ the GA memory (''-G'' option), and $N$ the number of nodes. 
 +In addition, at least 200 MW per process should be added for the program itself. 
 +In total, a calculation needs about $8\cdot[n\cdot(m+0.3)+G]/N$ GB (gigabytes) of memory ($n,m,G$ in gw), and this should not exceed the physical memory of the machine(s) used. 
 +We do note that in many calculations leaving some memory for the system to buffer I/O operations may improve the performance significantly. 
 + 
 +A proper ratio for the splitting between "stack" and GA memory depends on the calculation and the system. 
 +Some experiments are usually required to obtain optimal performance. 
 +As a rule of thumb, the following choice can be used as a starting point: 
 + 
 +  * **Density fitting and PNO calculations**: divide the memory into equal parts for GA and stack memory. 
 +  * **HF, DFT, and MCSCF calculations**: 75% for the stack and 25% for GA. 
 +  * **Canonical MRCI or CCSD(T) calculations** on one-node: no GA space is needed. 
 + 
 +In order to facilitate the memory splitting, the ''-M'' option is provided (in the following, its value is denoted $M$).  
 +With this, the total memory allocatable by Molpro can be specified, and the memory is split 50-50 for stack and GA in DF/PNO calculations, and 80-20 in other calculations. 
 +Thus, unless specified otherwise, in DF/PNO calculations the stack memory per process is $m=M\cdot N/(2\cdot n)$ and the total GA memory is $G=N\cdot M/2$. 
 +If the use of GA in storing large data structure is desired, it is recommended to provide a default ''-M'' value in .molprorc (**do not do so for disk-based calculation, see [[#disk option]]**), e.g. ''-M=25g'' for a dedicated machine with 256 GB of memory and 20 cores (.molprorc can be in the home directory and/or in the submission directory, the latter having preference). Then each Molpro run would be able to use the whole memory of the machine with reasonable splitting between stack and GA. The default can be overwritten or modified by molpro command line options ''-m'' and/or ''-G'', or by input options (cf. section [[general program structure#memory allocation|memory allocation]]), the latter having preference over command line options. 
 + 
 +If the ''-G'' or ''-M'' options are given, some programs check at early stages if the GA space is sufficient. If not, an error exit occurs and the estimated amount of required GA is printed. In this case the calculation should be repeated, specifying (at least) the printed amount of GA space with the ''-G'' option. If crashes without such message occur, the calculation should also be repeated with more GA space or with the disk option, but care should be taken that the total memory per node does not get too large.
  
 The behavior of various option combinations is as follows: The behavior of various option combinations is as follows:
  
   * **''-M''** As described above.   * **''-M''** As described above.
-  * **''-M'' and ''-m''** The specified amount $m$ is allocated for each core, the the remaining memory for GA.+  * **''-M'' and ''-m''** The specified amount $m$ is allocated for each core, the remaining memory for GA.
   * **''-M'' and ''-G''** The specified amount $G$ is allocated for GA, and the remaining amount is split equally for stack memory of each process.   * **''-M'' and ''-G''** The specified amount $G$ is allocated for GA, and the remaining amount is split equally for stack memory of each process.
   * **''-M'' and ''-G'' and ''-m''** The specified amounts of $m$ and $G$ are allocated, and the $M$ value is ignored.   * **''-M'' and ''-G'' and ''-m''** The specified amounts of $m$ and $G$ are allocated, and the $M$ value is ignored.
Line 52: Line 70:
   * **''nothing''** Same as -m32m.   * **''nothing''** Same as -m32m.
  
-If the ''-G'' or ''-M'' are present, the GA space is preallocated except with GA/mpi-pr. If neither ''-G'' nor ''-M'' are given, no preallocation and no checks of GA space are performed.+If the ''-G'' or ''-M'' are present, the GA space is preallocated unless GA is using helper processes (i.e., a comix-based runtime is used and preallocation of GA is not necessary). 
 +If neither ''-G'' nor ''-M'' are given, no preallocation and no checks of GA space are performed.
  
 ===== Disk option ===== ===== Disk option =====
  
-Some programs in Molpro (DF-HF, DF-KS, DF-MULTI, DF-TDDFT, PNO-LCCSDsupport an option ''implementation=disk'' to store large global data structures in a file system () instead of GA+Since version 2021.1, Molpro can use MPI files instead of GlobalArrays to store large global data. This option can be enabled globally by setting the environment variable ''MOLPRO_GA_IMPL'' to ''DISK'', or by passing the ''%%--ga-impl disk%%'' command-line option. 
-This file system for must be accessible by all processors. +Since version 2021.2 the disk option is made the default in single-node calculations. 
-If the path is different from the default Molpro scratch directory (which is supposedly shared by processors on each node), the ''-D'' command line option or the ''MOLPRO_GLOBAL_SCRATCH'' environment variable can be used.+Some programs in Molpro including DF-HF, DF-KS, (DF-)MULTI, DF-TDDFT, and PNO-LCCSD also support an input option ''implementation=disk'' to enable the disk option for the particular job step
 +The file system for these MPI files must be accessible by all processors. 
 +By default the default Molpro scratch directory is usedbut another directory can be chosen for MPI Files using the ''-D'' command line option or the ''MOLPRO_GLOBAL_SCRATCH'' environment variable
 +The directory can be tmpfs (e.g., ''-D /dev/shm'') in single-node calculations, and in this case the GAs / MPI Files are kept in shared memory.
  
-The disk option can be particularly helpful in single-node calculations in the following scenarios:+With the disk option the problems associated with GA pre-allocation are avoided. In this case 
 +use only ''-m'' or the ''memory'' card to specify Molpro scratch memory for each processor. To avoid GA preallocation **do not provide ''-M'' or ''-G''**. 
 +Please also make sure that ''-M'' and ''-G'' are not present in ''.molprorc'', etc.
  
-  * To perform a large calculation on a node that has fast scratch disks but insufficient memory to carry out the calculation using GA+The performance of the disk option varies depending on the I/O capacity, available system memory, the MPI software, and the nature of the calculation. 
-  * To avoid the problems arising from GA by using the disk option and use a tmpfs (e.g., with ''-D /dev/shm''to store the global data.+Usually, the best practice is to reserve some system memory for the system to buffer I/O operations (i.e., not to allocate all available memory to Molpro with ''-m'' or the ''memory'' input card)
 +When this is done the performance of single-node disk-based calculations can be comparable to GA-based ones in many cases, in particular with SSDs.
  
-In both cases, please use only ''-m'' or the ''memory'' card to specify Molpro scratch memory for each processor, and **do not provide ''-M'' or ''-G''** to avoid GA preallocation. 
-Please also make sure ''-M'' and ''-G'' are not present in ''.molprorc'', etc. 
-However, the math about splitting memory between Molpro stack and the global data still applies. 
-When the global data is written to a tmpfs, a good starting point is to allocate half of the available memory to Molpro stack with ''-m''. 
-Even when the global data is written to a disk, it is usually best not to allocate all available memory to Molpro stack when possible, so that the system could use more memory buffering the I/O and therefore speeds up the calculation. 
- 
-In the DF-HF Fock builder (also used in DF-KS, DF-MULTI and DF-TDDFT) two different disk options are available: the above option ''implementation=disk'' uses MPI Files (possibly memory mapped if ''-D /dev/shm'' is given), while option ''use_disk'' uses shared files implemented by the GA software. 
-''implementation=ga'' (default) uses GAs. According to our experience, this is fastest but requires a lot of GA space (-G may be necessary). Which of the other options is best may depend on the system size, the available memory, and the disk configuration. The options ''implementation=disk''  or ''use_disk'' can be given on the corresponding command lines or on a preceding ''CFIT/DFIT'' directive. 
 ===== Embarrassing parallel computation of gradients or Hessians (mppx mode)  ===== ===== Embarrassing parallel computation of gradients or Hessians (mppx mode)  =====
  
 The numerical computation of gradients or Hessians, or the automatic generation of potential energy surfaces, requires many similar calculations at different (displaced) geometries. An automatic parallel computation of the energy and/or gradients at different geometries is implemented for the gradient, hessian, and surf programs. In this so-called mppx-mode, each processing core runs an independent calculation in serial mode. This happens automatically using the ''-n'' available cores. The automatic mppx processing can be switched off by setting option ''mppx=0'' on the ''OPTG'', ''FREQ'', or ''HESSIAN'' command lines. In this case, the program will process each displacement in the standard parallel mode.  The numerical computation of gradients or Hessians, or the automatic generation of potential energy surfaces, requires many similar calculations at different (displaced) geometries. An automatic parallel computation of the energy and/or gradients at different geometries is implemented for the gradient, hessian, and surf programs. In this so-called mppx-mode, each processing core runs an independent calculation in serial mode. This happens automatically using the ''-n'' available cores. The automatic mppx processing can be switched off by setting option ''mppx=0'' on the ''OPTG'', ''FREQ'', or ''HESSIAN'' command lines. In this case, the program will process each displacement in the standard parallel mode. 
 +
 +===== Options for developers =====
 +
 +==== Debugging options ====
 +
 +  * **''%%--ga-debug%%''** activates GA debugging statements.
 +  * **''%%--check-collective%%''** check collective operations when debugging.
 +
 +==== Options for pure MPI-based PPIDD build ====
 +
 +This section is **not** applicable if the Molpro binary release is used, or when Molpro is built using the GlobalArrays toolkit (which we recommend).
 +
 +In the case of the pure MPI implementation of PPIDD, there is a choice of using either MPI-2 one-sided memory access, or devoting some of the processes to act as data "helpers". It is generally found that performance is significantly better if at least one dedicated helper is used, and in some cases it is advisable to specify more. The scalable limit is to devote one core on each node in a typical multi-core cluster machine, but in most cases it is possible to manage with fewer, thereby making more cores available for computation. The options below
 +
 +  * **''%%--multiple-helper-server%%'' //nprocs_per_server//** enables the multiple helper servers, and //nprocs_per_server// sets how many processes own one helper server. For example, when total number of processes is specified as $32$ and $nprocs\_per\_server=8$, then every $8$ processes(including helper server) will own one helper server, and there are $4$ helper servers in total. For any unreasonable value of $nprocs\_per\_server$ (i.e., any integer less than 2), it will be reset to a very large number automatically, and this will be equivalent to option ''%%--single-helper-server%%''.
 +  * **''%%--node-helper-server%%''** specifies one helper server on every node if all the nodes are symmetric and have reasonable processes (i.e., every node has the same number of processes, and the number should be greater than 1), and this is the default behaviour. Otherwise, only one single helper server for all processes/nodes will be used, and this will be equivalent to option ''%%--single-helper-server%%''
 +  * **''%%--single-helper-server%%''** specifies only one single helper server for all processes.
 +  * **''%%--no-helper-server%%''** disables the helper server.
 +
 +In the cases of one or more helper servers enabled, one or more processes act as data helper servers, and the rest processes are used for computation. Even so, it is quite competitive in performance when it is run with a large number of processes. In the case of helper server disabled, all processes are used for computation; however, the performance may not be good because of the poor performance of some existing implementations of the MPI-2 standard for one-sided operations.
 +
 +