Dear Giacomo,
The problem might be that even though the process was started on 64
CPUs, the NSCM environment variable was set to "1". When adf or another
program from the adf package determines NSCM=1 it quits all slave nodes
except master and only master continues to run. This is done only on the
IBM platform with MPI because in this configuration the user has little
influence on how many processors to run an MPI program.
Running adf in parallel is not always desirable, for example, in
'create' runs.
To avoid the problem you described, set NSCM to the number of requested
processors in your script and make sure you do NOT set it to 1 in
.cshrc, .bashrc or any other shell resource file.
Hope this helps,
Alexei Yakovlev
Giacomo Saielli wrote:
> Dear ADF developers and users,
> I tried to run a parallel job of cpl.exe on 64 procs (IBM Sp5), the
> input is below.
> However it seems that the program only used 1 processor to calculate
> the integrals, spending there (integral section) the whole slot of 24
> hours that was supposed to be used for the full calculation.
> In the output I can read the following lines:
>
> ...etc...
> ...etc...
> 62 sp021 -1 62
> 63 sp021 -1 63
>
> Communication Options:
> ----------------------
> Broadcast: 4
> Gather : 4
> Combine : 4
>
>
> ... continue to run with only one process active ...
>
> 1(INPUT FILE)
> maxmemoryusage 800
> nmrcoupling
> dso
> .. etc....
> .. etc...
>
> When it says ...continue to run with only one process active... does
> it really mean that it is only using one processor? The system
> administrator is suggesting that it might be a problem related with a
> low memory request in the input.
> Does anybony know how to solve the problem in order to run the job on
> 64 processor instead of just one?
>
> Thank you very much for you kind assistance,
>
> --- INPUT-----
> # @ wall_clock_limit = 24:00:00
> # @ network.MPI = csss,shared,US
> # @ total_tasks = 64
> # @ blocking = unlimited
> # @ resources = ConsumableCpus(1) ConsumableMemory(1500 mb)
> # @ shell = /bin/tcsh
> # @ class = largepar
> # @ job_type = parallel
> # @ output = job.log
> # @ error = job.err
> # @ queue
>
> $ADFBIN/cpl <<eor >cpl.out
> maxmemoryusage 800
> nmrcoupling
> dso
> pso
> fc
> sd
> scf convergence 1e-5
> nuclei 1 2 9 end
> endinput
> eor
>
>
> --
> Giacomo Saielli
> Istituto per la Tecnologia delle Membrane del CNR, Sezione di Padova;
> Dipartimento di Scienze Chimiche, Via Marzolo, 1 - 35131 Padova,
> Italy; Tel: +39-049-8275279; Fax: -5239;
> http://www.chfi.unipd.it/home/g.saielli/
>
Received on 2006-02-14 14:36:10
This archive was generated by hypermail 2.2.0 : 2006-11-02 07:00:02 CET