[ADF-LIST] ADF2013 parallel failure on single node : mmap of rcvhdrq failed: Resource temporarily unavailable

Alexei Yakovlev yakovlev at scm.com
Tue Sep 24 10:24:47 CEST 2013

Dear Edrisse,

The $ADFBIN/start file is a shell script that takes care of starting 
(serial or parallel) ADF binary executables. Since you did not change 
the file there is no need to send it.

About the root problem: it looks like a Qlogic PSM and pinned memory issue.

Please follow suggestions found in the "Lock Enough Memory on Nodes when 
Using SLURM" section of the Qlogic manual at 
http://filedownloads.qlogic.com/files/manual/68842/IB6054601-00G.pdf 
(found via Google).

After that please check that you can run a simple MPI job over PSM using 
the MPI bundled with the Qlogic software.

Kind regards,
Alexei

On 24/09/2013 09:57, Edrisse Chermak wrote:
> Dear all,
> Thanks a lot for your suggestions :
>
> Dear Hans :
> ===========
> unfortunately I don't have any start.exe file in my $ADFBIN (and nowhere
> else), typing "start" says me :
> --------------------------------------------------------------
> Executable ./start.exe could not be found
> Make sure that ./start.exe is in the same location as ./start
> For help: ./start -h
> --------------------------------------------------------------
> Could you please tell me how to get it if it is generic (x86_64) ?
>
> Dear Reuti & Alexei:
> ====================
> I applied the change in qconf -mconf.
> I also added "export MPI_REMSH=rsh" and "export MPI_TMPDIR=$TMPDIR" in
> the job script.
> I also set up the LD_LIBRARY_PATH and other pathes so that it matches
> the built-in mpirun of ADF. Now in the error log I get :
>
> ------------------------------
> mpirun: rsh: Command not found
> ------------------------------
>
> if I remove the  "export MPI_REMSH=rsh", I get :
>
> ----------------------------------------------------------------------
> c1bay5.0ipath_userinit: mmap of pio buffers at 100000 failed: Resource
> temporarily unavailable
> ----------------------------------------------------------------------
> and :
> ----------------------------------------------------------------------
> c1bay5.0Driver initialization failure on /dev/ipath (err=23)
>  adf.exe: Rank 0:0: MPI_Init: psm_ep_open() failed
>  adf.exe: Rank 0:0: MPI_Init: Can't initialize RDMA device
>  adf.exe: Rank 0:0: MPI_Init: Internal Error: Cannot initialize RDMA
> protocol
>  MPI Application rank 0 exited before MPI_Init() with status 1
>  mpirun: Broken pipe
> ----------------------------------------------------------------------
>
> Sorry for the long post & thanks for your help,
> Edrisse
>
> On 09/23/2013 11:17 AM, Hans van Schoot wrote:
>> Dear Edrisse Chermak,
>>
>> This issue is most likely related to MPI, it looks like your local MPI
>> installation is overruling the one used to build ADF.
>> could you send $ADFBIN/start to support at scm.com, and describe how you
>> start your job?
>>
>> best regards,
>> Hans van Schoot
>>
>>
>> On 22-09-13 21:11, Edrisse Chermak wrote:
>>> Dear ADF developers and users,
>>>
>>> I can run ADF in serial mode without problem but when I ask more than 1
>>> CPU it crashes saying :
>>> ***********************************************************
>>>  mmap of rcvhdrq failed: Resource temporarily unavailable
>>> ***********************************************************
>>> and also:
>>> ***********************************************************
>>> Mismatched user minor version (12) and driver minor version (11) while
>>> context sharing. Ensure that driver and library are from the same
>>> release.
>>> **********************************************************
>>>
>>> I attached the (normal) output with 1CPU and the one with error logs
>>> when more than 1 CPU. Have you got any idea from where the problem is ?
>>>
>>> Note1 : I'm using Grid Engine 2011.1 as queuing system
>>> Note2 : I tried to relax memory-lock limits in /etc/security/limit.conf
>>> but it doesn't solve the issue.
>>>
>>> Thanks in advance for your kind advices,
>>> Best Regards,
>>> Edrisse
>>>
>>> ________________________________
>>>
>>> This message and its contents including attachments are intended
>>> solely for the original recipient. If you are not the intended
>>> recipient or have received this message in error, please notify me
>>> immediately and delete this message from your computer system. Any
>>> unauthorized use or distribution is prohibited. Please consider the
>>> environment before printing this email.
>>>
>>>
>>> _______________________________________________
>>> ADFlist mailing list
>>> ADFlist at scm.com
>>> http://lists.scm.com/mailman/listinfo/adflist
>>
>
> ________________________________
>
> This message and its contents including attachments are intended 
> solely for the original recipient. If you are not the intended 
> recipient or have received this message in error, please notify me 
> immediately and delete this message from your computer system. Any 
> unauthorized use or distribution is prohibited. Please consider the 
> environment before printing this email.



More information about the ADFlist mailing list