Re: Errors runing ADF using MPI on a cluster

Search:

Re: Errors runing ADF using MPI on a cluster

From: Alexei Yakovlev <yakovlev_at_email.domain.hidden>
Date: Tue, 16 Aug 2005 10:09:26 +0200

Dear Dr. Huang,

These error messages usually indicate that environment variables are not
passed from mpirun to the ADF master node. Whether variables are passed
or not depends on the MPI implementation and can sometimes be influenced
by mpirun switches. From my experience, the mpirun command that comes
with MPICH (ch_p4) only passes environment if the master node is spawned
directly by mpirun and not via ssh/rsh. The latter happens when one is
using -nolocal. As far as I know, ch_gm has a command line option to
request passing of certain environment variables from mpirun to the
executed program. For example, to pass two variable FOO_1 and FOO_2 to a
parallel program use the following command option (from
http://www.myri.com/scs/READMES/README-mpich-gm):
mpirun FOO_1=<my_value_there> FOO_2=<another_value> -np 2 foo.x

For ADF, the most important environment variables are SCM_VECTORLENGTH,
SCM_WD, SCM_IOBUFFERSIZE, SCM_DIAG_EXCLUDE, SCM_DEBUG, SCM_NEWDIAG,
SCM_TMPDIR, SCM_USETMPDIR, SCM_NOMEMCHECK, and SCMLICENSE, SCM_STDIN.

In the next version, we've implemented a permanent solution. All
necessary environment variables can be passed to ADF on the command line
as -DNAME=value. We hope this solution will work in all cases.

Best regards,
Alexei

Y. Huang wrote:

>Hello,
>
>I am running ADF jobs on a dual Opteron cluster connected by Myrinet
>interconnect. OS is SuSE. MPICH-GM has been installed on the system. I
>have set the following environment variables in .bashrc
>
>export ADFHOME=/home/huang/software/ADF/adf2004.01
>export ADFBIN=$ADFHOME/bin
>export ADFRESOURCES=$ADFHOME/atomicdata
>export SCMLICENSE=$ADFHOME/license
>export MPIDIR=/opt/mpigm
>export SCM_MACHINEFILE=/home/huang/nodes
>export NSCM=4
>export SCM_TMPDIR=/tmp
>export SCM_USETMPDIR=yes
>export P4_RSHCOMMAND=/usr/bin/ssh
>
>The file /home/huang/nodes contains 4 nodes.
>
>I got the following errors when I ran the example accompanied with the ADF
>package in the directory of examples/adf/e_AIM_HF:
>
>...
>/bin/cp: omitting directory `/dev'
>/bin/cp: omitting directory `/etc'
>/bin/cp: omitting directory `/home'
>/bin/cp: omitting directory `/lib'
>/bin/cp: omitting directory `/lib64'
>/bin/cp: omitting directory `/lost+found'
>/bin/cp: omitting directory `/media'
>/bin/cp: omitting directory `/mnt'
>/bin/cp: omitting directory `/opt'
>/bin/cp: omitting directory `/opt.orig'
>/bin/cp: omitting directory `/proc'
>/bin/cp: omitting directory `/root'
>/bin/cp: omitting directory `/sbin'
>/bin/cp: omitting directory `/srv'
>/bin/cp: omitting directory `/sys'
>/bin/cp: omitting directory `/tmp'
>/bin/cp: omitting directory `/usr'
>/bin/cp: omitting directory `/var'
>...
> <Aug15-2005> <11:56:10> ADF 2004.01 RunTime: Aug15-2005 11:56:10
> <Aug15-2005> <11:56:10> *** (NO TITLE) ***
> <Aug15-2005> <11:56:10> RunType : SINGLE POINT
> <Aug15-2005> <11:56:10> NO ATOMS INPUT
> <Aug15-2005> <11:56:10> ardel: last array to be deleted not found
> <Aug15-2005> <11:56:10> NO ATOMS INPUT
> <Aug15-2005> <11:56:10> END
>ERROR DETECTED
> ************* Input file not found *************
>KFEXIT
>Contents of rdt21.res:
>cat: rdt21.res: No such file or directory
>Contents of WFN-alpha:
>cat: WFN-alpha: No such file or directory
>Contents of WFN-beta:
>cat: WFN-beta: No such file or directory
>
>
>Any suggestion is appreciated.
>
>Yiye
>
>**********************************
> Faculty of Science
> University of Waterloo
> Waterloo, Ontario N2L 3G1
> E-mail: huang_at_uwaterloo.ca
> Tel: (519) 888-4567 ext.6110
>**********************************
>| (\
>| http://hpc.uwaterloo.ca ( \
>|__________________________) ) />
> / ) / //))/
> \ \_/ /////
> \ /
> \_ /
> | |
> |___|
>
>
>
>

Received on 2005-08-16 10:08:15

This archive was generated by hypermail 2.2.0 : 2006-11-02 07:00:02 CET