[molpro-user] problems with global file system when running in parallel

Jörg Saßmannshausen j.sassmannshausen at ucl.ac.uk
Tue Feb 5 12:13:18 GMT 2013


Hi Manhui,

thanks, that was exactly what I was looking for. 

I will try it out at the weekend when the cluster is less busy and report back 
so others who might have similar problems can have a look as well.

In my case, as I am more or less the only one using the larger global file 
system it should be ok. I agree it is not ideal as scratch should be local, 
however, in some circumstances where you simply need more scratch space it is 
the only option I got for that particular cluster.

All the best from a sunny but cold London

Jörg


On Tuesday 05 February 2013 10:55:37 Manhui Wang wrote:
> Hi  Jörg,
> 
> Molpro itself can deal with the temporary files with different names for
> different process to avoid the possible I/O conflict.
> When using global shared scratch space, more efficient way is to set
> different scratch directories for different nodes. About this, please
> refer to the previous discussion:
> http://www.molpro.net/pipermail/molpro-user/2011-March/004229.html
> 
> In most cases, Molpro could achieve better performance with local
> scratch, while with global shared scratch the I/O speed could become the
> bottleneck
> when hundreds of users  get access to the huge data on global shared
> file system.
> 
> Best wishes,
> Manhui
> 
> On 04/02/13 23:51, Jörg Saßmannshausen wrote:
> > Hi Jeff,
> > 
> > thanks for the feedback.
> > 
> > What I cannot really work out is: on my 8-core machine it is all working
> > and here I only got one (local) scratch space.
> > Thus, I would have thought that this is not a problem.
> > 
> > I can see where you are coming from, however, I would not know how to
> > generate different scratch for the different nodes where the job is
> > running on. The only option I found in the Molpro manual regarding
> > scratch space is the -d flag. Here you give a full path for the scratch
> > space and hence I would not know how to say core1 is using space1 etc.
> > 
> > I thought of Sys5 shm as well but as I have already set it higher to use
> > NWChem on that machine and as it runs with local scratch I would have
> > thought there is no problem here.
> > 
> > I am still a bit puzzled here.
> > 
> > All the best from London
> > 
> > Jörg
> > 
> > On Montag 04 Februar 2013 Jeff Hammond wrote:
> >> If you want shared scratch to behave as if it was local scratch, just
> >> create a subdirectory for each process to ensure that no I/O is
> >> conflicted.  NWChem does this automagically with *.${procid} file
> >> suffixes but it's easy enough to use a directory instead since that
> >> requires no source changes.
> >> 
> >> Molpro might have an option for this but I don't know what it is.
> >> 
> >> Note also that I cannot be certain that the error messages you see
> >> aren't a side-effect of Sys5 shm exhaustion, which has nothing to do
> >> with file I/O, but since you say this job runs fine on local scratch,
> >> I'll assume that Sys5 is not the issue.  ARMCI error messages are not
> >> always as they seem.
> >> 
> >> Jeff
> >> 
> >> On Mon, Feb 4, 2013 at 6:42 AM, Jörg Saßmannshausen
> >> 
> >> <j.sassmannshausen at ucl.ac.uk> wrote:
> >>> Dear all,
> >>> 
> >>> I was wondering if somebody could shed some light on here.
> >>> 
> >>> When I am trying to do a DF-LCCSD(T) calculation, the first few steps
> >>> are
> >>> 
> >>> working ok but then the program crashes when it comes to here:
> >>>   MP2 energy of close pairs:            -0.09170948
> >>>   MP2 energy of weak pairs:             -0.06901764
> >>>   MP2 energy of distant pairs:          -0.00191297
> >>>   
> >>>   MP2 correlation energy:               -2.48344057
> >>>   MP2 total energy:                   -940.89652776
> >>>   
> >>>   LMP2 singlet pair energy              -1.53042229
> >>>   LMP2 triplet pair energy              -0.95301828
> >>>   
> >>>   SCS-LMP2 correlation energy:          -2.42949590   (PS=  1.200000 
> >>>   PT=
> >>> 
> >>> 0.333333)
> >>> 
> >>>   SCS-LMP2 total energy:              -940.84258309
> >>>   
> >>>   Minimum Memory for K-operators:     2.48 MW Maximum memory for
> >>>   K-operators
> >>> 
> >>> 28.97 MW  used:    28.97 MW
> >>> 
> >>>   Memory for amplitude vector:        0.52 MW
> >>>   
> >>>   Minimum memory for LCCSD:     8.15 MW, used:     65.01 MW, max:
> >>>   64.48 MW
> >>>   
> >>>   ITER.      SQ.NORM     CORR.ENERGY   TOTAL ENERGY   ENERGY CHANGE
> >>>   DEN1
> >>> 
> >>> VAR(S)    VAR(P)  DIIS     TIME
> >>> 
> >>>     1      1.96000293    -2.52977250  -940.94285970    -0.04633193
> >>> 
> >>> -2.42872569  0.35D-01  0.15D-01  1  1   348.20
> >>> 
> >>> 
> >>> Here are the error messages which I found:
> >>> 
> >>> 5:Segmentation Violation error, status=: 11
> >>> (rank:5 hostname:node32 pid:5885):ARMCI DASSERT fail.
> >>> src/common/signaltrap.c:SigSegvHandler():310 cond:0
> >>> 
> >>>    5: ARMCI aborting 11 (0xb).
> >>> 
> >>> tmp = /home/sassy/pdir//usr/local/molpro-2012.1/bin/molpro.exe.p
> >>> 
> >>>   Creating: host=node33, user=sassy,
> >>> 
> >>> [ ... ]
> >>> 
> >>> and
> >>> 
> >>> Last System Error Message from Task 5:: Bad file descriptor
> >>> 
> >>>    5: ARMCI aborting 11 (0xb).
> >>> 
> >>> system error message: Invalid argument
> >>> 
> >>>   24: interrupt(1)
> >>> 
> >>> Last System Error Message from Task 2:: Bad file descriptor
> >>> Last System Error Message from Task 0:: Inappropriate ioctl for device
> >>> 
> >>>    2: ARMCI aborting 2 (0x2).
> >>> 
> >>> system error message: Invalid argument
> >>> Last System Error Message from Task 3:: Bad file descriptor
> >>> 
> >>>    3: ARMCI aborting 2 (0x2).
> >>> 
> >>> system error message: Invalid argument
> >>> WaitAll: Child (25216) finished, status=0x8200 (exited with code 130).
> >>> [ ... ]
> >>> 
> >>> I got the feeling there is a problem with reading/writing some files.
> >>> The global file system got around 158G of disc space free and as far as
> >>> I could see it it was not full at the time of the run.
> >>> 
> >>> Interestingly, the same input file but with the local scratch space was
> >>> working. As the local scratch is rather small I would use the global,
> >>> larger system.
> >>> 
> >>> Are there any known problems with that approach or is there something I
> >>> am doing wrong here?
> >>> 
> >>> All the best from a sunny London
> >>> 
> >>> Jörg
> >>> 
> >>> --
> >>> *************************************************************
> >>> Jörg Saßmannshausen
> >>> University College London
> >>> Department of Chemistry
> >>> Gordon Street
> >>> London
> >>> WC1H 0AJ
> >>> 
> >>> email: j.sassmannshausen at ucl.ac.uk
> >>> web: http://sassy.formativ.net
> >>> 
> >>> Please avoid sending me Word or PowerPoint attachments.
> >>> See http://www.gnu.org/philosophy/no-word-attachments.html
> >>> 
> >>> _______________________________________________
> >>> Molpro-user mailing list
> >>> Molpro-user at molpro.net
> >>> http://www.molpro.net/mailman/listinfo/molpro-user

-- 
*************************************************************
Jörg Saßmannshausen
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html




More information about the Molpro-user mailing list