[molpro-user] problems with global file system when running in parallel

Manhui Wang wangm9 at cardiff.ac.uk
Tue Feb 5 10:55:37 GMT 2013


Hi  Jörg,

Molpro itself can deal with the temporary files with different names for 
different process to avoid the possible I/O conflict.
When using global shared scratch space, more efficient way is to set 
different scratch directories for different nodes. About this, please 
refer to the previous discussion:
http://www.molpro.net/pipermail/molpro-user/2011-March/004229.html

In most cases, Molpro could achieve better performance with local 
scratch, while with global shared scratch the I/O speed could become the 
bottleneck
when hundreds of users  get access to the huge data on global shared 
file system.

Best wishes,
Manhui

On 04/02/13 23:51, Jörg Saßmannshausen wrote:
> Hi Jeff,
>
> thanks for the feedback.
>
> What I cannot really work out is: on my 8-core machine it is all working and
> here I only got one (local) scratch space.
> Thus, I would have thought that this is not a problem.
>
> I can see where you are coming from, however, I would not know how to generate
> different scratch for the different nodes where the job is running on. The only
> option I found in the Molpro manual regarding scratch space is the -d flag.
> Here you give a full path for the scratch space and hence I would not know how
> to say core1 is using space1 etc.
>
> I thought of Sys5 shm as well but as I have already set it higher to use
> NWChem on that machine and as it runs with local scratch I would have thought
> there is no problem here.
>
> I am still a bit puzzled here.
>
> All the best from London
>
> Jörg
>
>
> On Montag 04 Februar 2013 Jeff Hammond wrote:
>> If you want shared scratch to behave as if it was local scratch, just
>> create a subdirectory for each process to ensure that no I/O is
>> conflicted.  NWChem does this automagically with *.${procid} file
>> suffixes but it's easy enough to use a directory instead since that
>> requires no source changes.
>>
>> Molpro might have an option for this but I don't know what it is.
>>
>> Note also that I cannot be certain that the error messages you see
>> aren't a side-effect of Sys5 shm exhaustion, which has nothing to do
>> with file I/O, but since you say this job runs fine on local scratch,
>> I'll assume that Sys5 is not the issue.  ARMCI error messages are not
>> always as they seem.
>>
>> Jeff
>>
>> On Mon, Feb 4, 2013 at 6:42 AM, Jörg Saßmannshausen
>>
>> <j.sassmannshausen at ucl.ac.uk> wrote:
>>> Dear all,
>>>
>>> I was wondering if somebody could shed some light on here.
>>>
>>> When I am trying to do a DF-LCCSD(T) calculation, the first few steps are
>>>
>>> working ok but then the program crashes when it comes to here:
>>>   MP2 energy of close pairs:            -0.09170948
>>>   MP2 energy of weak pairs:             -0.06901764
>>>   MP2 energy of distant pairs:          -0.00191297
>>>   
>>>   MP2 correlation energy:               -2.48344057
>>>   MP2 total energy:                   -940.89652776
>>>   
>>>   LMP2 singlet pair energy              -1.53042229
>>>   LMP2 triplet pair energy              -0.95301828
>>>   
>>>   SCS-LMP2 correlation energy:          -2.42949590   (PS=  1.200000  PT=
>>>
>>> 0.333333)
>>>
>>>   SCS-LMP2 total energy:              -940.84258309
>>>   
>>>   Minimum Memory for K-operators:     2.48 MW Maximum memory for
>>>   K-operators
>>>
>>> 28.97 MW  used:    28.97 MW
>>>
>>>   Memory for amplitude vector:        0.52 MW
>>>   
>>>   Minimum memory for LCCSD:     8.15 MW, used:     65.01 MW, max:
>>>   64.48 MW
>>>   
>>>   ITER.      SQ.NORM     CORR.ENERGY   TOTAL ENERGY   ENERGY CHANGE
>>>   DEN1
>>>
>>> VAR(S)    VAR(P)  DIIS     TIME
>>>
>>>     1      1.96000293    -2.52977250  -940.94285970    -0.04633193
>>>
>>> -2.42872569  0.35D-01  0.15D-01  1  1   348.20
>>>
>>>
>>> Here are the error messages which I found:
>>>
>>> 5:Segmentation Violation error, status=: 11
>>> (rank:5 hostname:node32 pid:5885):ARMCI DASSERT fail.
>>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>>>
>>>    5: ARMCI aborting 11 (0xb).
>>>
>>> tmp = /home/sassy/pdir//usr/local/molpro-2012.1/bin/molpro.exe.p
>>>
>>>   Creating: host=node33, user=sassy,
>>>
>>> [ ... ]
>>>
>>> and
>>>
>>> Last System Error Message from Task 5:: Bad file descriptor
>>>
>>>    5: ARMCI aborting 11 (0xb).
>>>
>>> system error message: Invalid argument
>>>
>>>   24: interrupt(1)
>>>
>>> Last System Error Message from Task 2:: Bad file descriptor
>>> Last System Error Message from Task 0:: Inappropriate ioctl for device
>>>
>>>    2: ARMCI aborting 2 (0x2).
>>>
>>> system error message: Invalid argument
>>> Last System Error Message from Task 3:: Bad file descriptor
>>>
>>>    3: ARMCI aborting 2 (0x2).
>>>
>>> system error message: Invalid argument
>>> WaitAll: Child (25216) finished, status=0x8200 (exited with code 130).
>>> [ ... ]
>>>
>>> I got the feeling there is a problem with reading/writing some files.
>>> The global file system got around 158G of disc space free and as far as I
>>> could see it it was not full at the time of the run.
>>>
>>> Interestingly, the same input file but with the local scratch space was
>>> working. As the local scratch is rather small I would use the global,
>>> larger system.
>>>
>>> Are there any known problems with that approach or is there something I
>>> am doing wrong here?
>>>
>>> All the best from a sunny London
>>>
>>> Jörg
>>>
>>> --
>>> *************************************************************
>>> Jörg Saßmannshausen
>>> University College London
>>> Department of Chemistry
>>> Gordon Street
>>> London
>>> WC1H 0AJ
>>>
>>> email: j.sassmannshausen at ucl.ac.uk
>>> web: http://sassy.formativ.net
>>>
>>> Please avoid sending me Word or PowerPoint attachments.
>>> See http://www.gnu.org/philosophy/no-word-attachments.html
>>>
>>> _______________________________________________
>>> Molpro-user mailing list
>>> Molpro-user at molpro.net
>>> http://www.molpro.net/mailman/listinfo/molpro-user
>

-- 
-----------
Manhui  Wang
School of Chemistry, Cardiff University,
Main Building, Park Place,
Cardiff CF10 3AT, UK




More information about the Molpro-user mailing list