[ADF-LIST] ADF2013 parallel failure on single node : mmap of rcvhdrq failed: Resource temporarily unavailable

Reuti reuti at staff.uni-marburg.de
Tue Sep 24 11:05:09 CEST 2013

Hi,

Am 24.09.2013 um 09:57 schrieb Edrisse Chermak:

> Dear all,
> Thanks a lot for your suggestions :
> 
> Dear Hans :
> ===========
> unfortunately I don't have any start.exe file in my $ADFBIN (and nowhere
> else), typing "start" says me :
> --------------------------------------------------------------
> Executable ./start.exe could not be found
> Make sure that ./start.exe is in the same location as ./start
> For help: ./start -h
> --------------------------------------------------------------
> Could you please tell me how to get it if it is generic (x86_64) ?
> 
> Dear Reuti & Alexei:
> ====================
> I applied the change in qconf -mconf.
> I also added "export MPI_REMSH=rsh" and "export MPI_TMPDIR=$TMPDIR" in
> the job script.
> I also set up the LD_LIBRARY_PATH and other pathes so that it matches
> the built-in mpirun of ADF. Now in the error log I get :
> 
> ------------------------------
> mpirun: rsh: Command not found
> ------------------------------
> 
> if I remove the  "export MPI_REMSH=rsh", I get :

Good, then it's trying to use a Tight Integration into SGE, where all processes are under control of SGE. A defined PE should create a wrapper which will map `rsh` to `qrsh -inherit ...`. What type of PE are you requesting for your job and what is the output of:

$ qconf -sp foobar

where "foobar" is the name of the requested PE.

-- Reuti

NB: The PE definition wil also show, whether you are solely on one node, or starting a job between nodes. As long as it's local, there is usual no need to have access to the `qrsh -inherit ...` wrapper (or I didn't notice it before). My clusters have no `ssh` or `rsh` implemented (besides for admin staff), hence all calls between nodes need to be started by SGE.


> ----------------------------------------------------------------------
> c1bay5.0ipath_userinit: mmap of pio buffers at 100000 failed: Resource
> temporarily unavailable
> ----------------------------------------------------------------------
> and :
> ----------------------------------------------------------------------
> c1bay5.0Driver initialization failure on /dev/ipath (err=23)
> adf.exe: Rank 0:0: MPI_Init: psm_ep_open() failed
> adf.exe: Rank 0:0: MPI_Init: Can't initialize RDMA device
> adf.exe: Rank 0:0: MPI_Init: Internal Error: Cannot initialize RDMA
> protocol
> MPI Application rank 0 exited before MPI_Init() with status 1
> mpirun: Broken pipe
> ----------------------------------------------------------------------
> 
> Sorry for the long post & thanks for your help,
> Edrisse
> 
> On 09/23/2013 11:17 AM, Hans van Schoot wrote:
>> Dear Edrisse Chermak,
>> 
>> This issue is most likely related to MPI, it looks like your local MPI
>> installation is overruling the one used to build ADF.
>> could you send $ADFBIN/start to support at scm.com, and describe how you
>> start your job?
>> 
>> best regards,
>> Hans van Schoot
>> 
>> 
>> On 22-09-13 21:11, Edrisse Chermak wrote:
>>> Dear ADF developers and users,
>>> 
>>> I can run ADF in serial mode without problem but when I ask more than 1
>>> CPU it crashes saying :
>>> ***********************************************************
>>> mmap of rcvhdrq failed: Resource temporarily unavailable
>>> ***********************************************************
>>> and also:
>>> ***********************************************************
>>> Mismatched user minor version (12) and driver minor version (11) while
>>> context sharing. Ensure that driver and library are from the same
>>> release.
>>> **********************************************************
>>> 
>>> I attached the (normal) output with 1CPU and the one with error logs
>>> when more than 1 CPU. Have you got any idea from where the problem is ?
>>> 
>>> Note1 : I'm using Grid Engine 2011.1 as queuing system
>>> Note2 : I tried to relax memory-lock limits in /etc/security/limit.conf
>>> but it doesn't solve the issue.
>>> 
>>> Thanks in advance for your kind advices,
>>> Best Regards,
>>> Edrisse
>>> 
>>> ________________________________
>>> 
>>> This message and its contents including attachments are intended
>>> solely for the original recipient. If you are not the intended
>>> recipient or have received this message in error, please notify me
>>> immediately and delete this message from your computer system. Any
>>> unauthorized use or distribution is prohibited. Please consider the
>>> environment before printing this email.
>>> 
>>> 
>>> _______________________________________________
>>> ADFlist mailing list
>>> ADFlist at scm.com
>>> http://lists.scm.com/mailman/listinfo/adflist
>> 
> 
> ________________________________
> 
> This message and its contents including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
> 



More information about the ADFlist mailing list