Parallel runs with ADF

Search:

Parallel runs with ADF

From: Kris Harris <kjharris_at_email.domain.hidden>
Date: Tue, 5 Mar 2002 23:13:37 +0000 (UTC)

I've narrowed the problem down:
First I regressed to pvm 3.4.2 under linux to ensure compatibility with
the stock binaries and recompiled adf on the RS/6000 with PVM_ROOT and
PVM_ARCH (with pvm 3.4.4 still pvm doc. says they should communicate
properly) set correctly in case I did not do this in the correct order the
first time.

Again, pvm tests work and adf as -d -n2 splits jobs into two just fine on
either machine. I can rsh to either from either and run adfs fine.

First a description of the setup:
RS/6000 named cynthia
Linux redhat 7.1 named rabi

If I put a line:
echo randomwordlist > ~/somefile.txt
at the top of the both adfs scripts
and run a job adf -n2 on cynthia the word randomwordlist is never sent to
home/somefile.txt on either box meaning that the adfs script is never
called.
 rabi is looking for adfs in the directory where it is defined by
the variable SCMSPAWNSCRIPT on cynthia.

If I then set this variable (on cynthia) to point to where the adfs script
is located on rabi
and add the line:
echo $PROG > ~/prog.txt
After this if i start a job on cynthia as adf -n2
I get the file prog.txt (on rabi) that points to the location of adf.exe
on cynthia
and the word randomwordlist in somefile.txt (on rabi)
Meaning that the variable $PROG is also not getting set for the box the
script is to run on.

If I then replace the line:
$PROG << eor > /dev/null
with:
/dirs/locationofadf.exeonrabi <<eor > /dev/null
the program starts adf on each and there are some tape and logfiles
produced by the parent and kid but the job crashes after running for a
while with:
 STOP RECEIVED from 524323 , tag= 9491
 <Mar05-2002> <18:50:40> WARNING: not all scratch files were closed
 <Mar05-2002> <18:50:40> END

and the same warning about not all scrathfiles closed in the logfile of
$TEMPDIR/kid#/logfile (interestingly the variable $TEMPDIR is being set
properly)
I assume the the program loses track of things because of the jimmying
with the adfs script

Any ideas about what to do? The variables seem to all be set correctly
(rsh eitherhost env gives expected results) and it seems to be they must
be if adf -d -n2 splits jobs into two properly. It must be in the passing
of variables between computers but I don't know how this is done in this
program. I suppose it might work to but all the programs in the same
directories on all nodes but that seems a little cheap.

I would appreciate any help in this matter.
Kris Harris

kjharris_at_sdf.lonestar.org
SDF Public Access UNIX System - http://sdf.lonestar.org
Received on 2002-03-06 00:28:11

This archive was generated by hypermail 2.2.0 : 2006-11-02 07:00:02 CET