[molpro-user] problems with global file system when running in parallel

Jeff Hammond jhammond at alcf.anl.gov
Mon Feb 4 15:41:52 GMT 2013


If you want shared scratch to behave as if it was local scratch, just
create a subdirectory for each process to ensure that no I/O is
conflicted.  NWChem does this automagically with *.${procid} file
suffixes but it's easy enough to use a directory instead since that
requires no source changes.

Molpro might have an option for this but I don't know what it is.

Note also that I cannot be certain that the error messages you see
aren't a side-effect of Sys5 shm exhaustion, which has nothing to do
with file I/O, but since you say this job runs fine on local scratch,
I'll assume that Sys5 is not the issue.  ARMCI error messages are not
always as they seem.

Jeff

On Mon, Feb 4, 2013 at 6:42 AM, Jörg Saßmannshausen
<j.sassmannshausen at ucl.ac.uk> wrote:
> Dear all,
>
> I was wondering if somebody could shed some light on here.
>
> When I am trying to do a DF-LCCSD(T) calculation, the first few steps are
> working ok but then the program crashes when it comes to here:
>
>  MP2 energy of close pairs:            -0.09170948
>  MP2 energy of weak pairs:             -0.06901764
>  MP2 energy of distant pairs:          -0.00191297
>
>  MP2 correlation energy:               -2.48344057
>  MP2 total energy:                   -940.89652776
>
>  LMP2 singlet pair energy              -1.53042229
>  LMP2 triplet pair energy              -0.95301828
>
>  SCS-LMP2 correlation energy:          -2.42949590   (PS=  1.200000  PT=
> 0.333333)
>  SCS-LMP2 total energy:              -940.84258309
>
>  Minimum Memory for K-operators:     2.48 MW Maximum memory for K-operators
> 28.97 MW  used:    28.97 MW
>  Memory for amplitude vector:        0.52 MW
>
>  Minimum memory for LCCSD:     8.15 MW, used:     65.01 MW, max:     64.48 MW
>
>  ITER.      SQ.NORM     CORR.ENERGY   TOTAL ENERGY   ENERGY CHANGE        DEN1
> VAR(S)    VAR(P)  DIIS     TIME
>    1      1.96000293    -2.52977250  -940.94285970    -0.04633193
> -2.42872569  0.35D-01  0.15D-01  1  1   348.20
>
>
> Here are the error messages which I found:
>
> 5:Segmentation Violation error, status=: 11
> (rank:5 hostname:node32 pid:5885):ARMCI DASSERT fail.
> src/common/signaltrap.c:SigSegvHandler():310 cond:0
>   5: ARMCI aborting 11 (0xb).
> tmp = /home/sassy/pdir//usr/local/molpro-2012.1/bin/molpro.exe.p
>  Creating: host=node33, user=sassy,
> [ ... ]
>
> and
>
> Last System Error Message from Task 5:: Bad file descriptor
>   5: ARMCI aborting 11 (0xb).
> system error message: Invalid argument
>  24: interrupt(1)
> Last System Error Message from Task 2:: Bad file descriptor
> Last System Error Message from Task 0:: Inappropriate ioctl for device
>   2: ARMCI aborting 2 (0x2).
> system error message: Invalid argument
> Last System Error Message from Task 3:: Bad file descriptor
>   3: ARMCI aborting 2 (0x2).
> system error message: Invalid argument
> WaitAll: Child (25216) finished, status=0x8200 (exited with code 130).
> [ ... ]
>
> I got the feeling there is a problem with reading/writing some files.
> The global file system got around 158G of disc space free and as far as I could
> see it it was not full at the time of the run.
>
> Interestingly, the same input file but with the local scratch space was
> working. As the local scratch is rather small I would use the global, larger
> system.
>
> Are there any known problems with that approach or is there something I am
> doing wrong here?
>
> All the best from a sunny London
>
> Jörg
>
> --
> *************************************************************
> Jörg Saßmannshausen
> University College London
> Department of Chemistry
> Gordon Street
> London
> WC1H 0AJ
>
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>
> _______________________________________________
> Molpro-user mailing list
> Molpro-user at molpro.net
> http://www.molpro.net/mailman/listinfo/molpro-user



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond



More information about the Molpro-user mailing list