3.4. Job runners¶
Job runners have already been mentioned in previous chapters, in parts regarding job running. The aim of this chapter is to sum up all this information and to introduce various subclasses of JobRunner
.
Job runners in PLAMS are very simple objects, both from user’s perspective and in terms of internal architecture. They have no methods that are meant to be explicitly called, they are just supposed to be created and passed to run()
as parameters. JobRunner
(concrete) class defines a basic local job runner and serves as a base class for further subclassing.
3.4.1. Local job runner¶
-
class
JobRunner
(parallel=False, maxjobs=0)[source]¶ Class representing local job runner.
The goal of
JobRunner
is to take care of two important things – parallelization and runscript execution:- When the method
run()
of anyJob
instance is executed, this method, after some preparations, passes control to aJobRunner
instance. ThisJobRunner
instance decides if a separate thread should be spawned for this job or if the execution should proceed in the main thread. This decision is based onparallel
attribute which can be set onJobRunner
creation. There are no separate classes for serial and parallel job runner, both cases are covered byJobRunner
depending on one bool parameter. - If the executed job is an instance of
SingleJob
, it creates a shell script (called runscript) which contains most of the actual computational work (usually it is just an execution of some external binary). The runscript is then submitted to aJobRunner
instance using its methodcall()
. This method executes the runscript in a separate subprocess and takes care of setting proper working directory, output and error stream handling etc.
The number of simultaneously running
call()
methods can be limited using maxjobs parameter. If maxjobs is 0, no limit is enforced. If parallel isFalse
, maxjobs is ignored. If parallel isTrue
and maxjobs is a positive integer, aBoundedSemaphore
of that size is used to limit the number of concurrently runningcall()
methods.A
JobRunner
instance can be passed torun()
with a keyword argumentjobrunner
. If this argument is omitted, the instance stored inconfig.default_jobrunner
is used.-
call
(runscript, workdir, out, err, **kwargs)[source]¶ Execute the runscript in the folder workdir. Redirect output and error streams to out and err, respectively.
Arguments mentioned above should be strings containing paths to corresponding files or folders
Other keyword arguments are ignored here but they can be useful in
JobRunner
subclasses (seeGridRunner.call()
).Returns integer value indicating the exit code returned by execution of runscript.
This method can be safely overridden in
JobRunner
subclasses. For example, inGridRunner
it submits the runscript to a job scheduler instead of executing it locally.Note
This method is used automatically during
run()
and should never be explicitly called in your script.
-
_run_job
(job, jobmanager)[source]¶ This method aggregates these parts of
run()
that are supposed to be run in a separate thread in case of parallel job execution. It is wrapped with_in_thread()
decorator.This method should not be overridden.
- When the method
3.4.2. Remote job runner¶
-
class
GridRunner
(grid='auto', sleepstep=None, **kwargs)[source]¶ Subclass of
JobRunner
that submits the runscript to a job scheduler instead of executing it locally. Besides two new keyword arguments (grid and sleepstep) and differentcall()
method it behaves and is meant to be used just like regularJobRunner
.There are many different job schedulers that are popular and widely used nowadays (for example TORQUE, SLURM, OGE). Usually they use different commands for submitting jobs or checking queue status. This class tries to build a common and flexible interface for all those tools. The idea is that commands used to communicate with job scheduler are not rigidly hard-coded but dynamically taken from a
Settings
instance instead. Thanks to that user has almost full control over the behavior ofGridRunner
.So the behavior of
GridRunner
is determined by the contents ofSettings
instance stored in itssettings
attribute. ThisSettings
instance can be manually supplied by the user or taken from a collection of predefined behaviors stored as branches ofconfig.gridrunner
. The adjustment is done via grid parameter that should be either string orSettings
. If string, it has to be a key occurring inconfig.gridrunner
(or'auto'
for autodetection). For example, ifgrid='slurm'
is passed,config.gridrunner.slurm
is linked assettings
. If grid is'auto'
, entries inconfig.gridrunner
are tested one by one and the first one that works (its submit command is present on your system) is chosen. When aSettings
instance is passed it gets plugged directly assettings
.Currently two predefined job schedulers are available (see
plams_defaults.py
):slurm
for SLURM andpbs
for job schedulers following PBS syntax (PBS, TORQUE, Oracle Grid Engine etc.).The
Settings
instance used forGridRunner
should have the following structure:.output
– flag for specifying output file path..error
– flag for specifying error file path..workdir
– flag for specifying path to working directory..commands.submit
– submit command..commands.check
– queue status check command..commands.getid
– function extracting submitted job’s ID from output of submit command..commands.finished
– function checking if submitted job is finished. It should take a single string (job’s ID) and return boolean..commands.special.
– branch storing definitions of specialrun()
keyword arguments.
See
call()
for more technical details and examples.The sleepstep parameter defines how often the job is checked for being finished. It should be an integer value telling how many seconds should the interval between two checks last. If
None
, the global default fromconfig.sleepstep
is copied.Note
Usually job schedulers are configured in such a way that output of your job is captured somewhere else and copied to the location indicated by output flag when the job is finished. Because of that it is not possible to have a peek at your output while your job is running (for example to see if your calculation is going well). This limitation can be worked around with
[Job].settings.runscript.stdout_redirect
. If set toTrue
, the output redirection will not be handled by a job scheduler, but built in the runscript using shell redirection>
. That forces the output file to be created directly in workdir and updated live as the job proceeds.-
call
(runscript, workdir, out, err, runflags, **kwargs)[source]¶ Submit runscript to the job scheduler with workdir as the working directory. Redirect output and error streams to out and err, respectively. runflags stores submit command options.
The submit command has the following structure. Underscores denote spaces, parts in pointy brackets correspond to
settings
entries, parts in curly brackets tocall()
arguments, square brackets contain optional parts:<.commands.submit>_<.workdir>_{workdir}_<.error>_{err}[_<.output>_{out}][FLAGS]_{runscript}
Output part is added if out is not
None
. This is handled automatically based on.runscript.stdout_redirect
value in job’ssettings
.FLAGS
part is built based on runflags argument, which is a dictionary storingrun()
keyword arguments. For every (key,value) pair in runflags the string_-key_value
is appended unless key is a special key occurring in.commands.special.
. In that case_<.commands.special.key>value
is used (mind the lack of space in between!). For example, aSettings
instance defining interaction with SLURM job scheduler stored inconfig.gridrunner.slurm
has the following entries:.workdir = '-D' .output = '-o' .error = '-e' .special.nodes = '-N ' .special.walltime = '-t ' .special.queue = '-p ' .commands.submit = 'sbatch' .commands.check = 'squeue -j '
The submit command produced in such case:
>>> gr = GridRunner(parallel=True, maxjobs=4, grid='slurm') >>> j.run(jobrunner=gr, queue='short', nodes=2, J='something', O='')
will be:
sbatch -D {workdir} -e {err} -o {out} -p short -N 2 -J something -O {runscript}
In some job schedulers some flags don’t have a short form with semantics
-key value
. For example, in SLURM the flag--nodefile=value
have a short form-F value
, but the flag--export=value
does not. One can still use such a flag using special keys mechanism:>>> gr = GridRunner(parallel=True, maxjobs=4, grid='slurm') >>> gr.settings.special.export = '--export=' >>> j.run(jobrunner=gr, queue='short', export='value') sbatch -D {workdir} -e {err} -o {out} -p short --export=value {runscript}
The submit command produced in the way explained above is then executed and returned output is used to determine submitted job’s ID. Function stored in
.commands.getid
is used for that purpose, it should take one string (whole output) and return a string with job’s ID.Now the method waits for the job to finish. Every
sleepstep
seconds it queries the job scheduler using following algorithm:- if a key
finished
exists in.commands.
it is used. It should be a function taking job’s ID and returningTrue
orFalse
. - otherwise a string stored in
.commands.check
is concatenated with job’s ID (with no space between) and such command is executed. Nonzero exit status indicates that job is no longer in job scheduler hence it is finished.
Since it is difficult (on some systems even impossible) to automatically obtain job’s exit code, the returned value is always 0. From
run()
perspective it means that a job executed withGridRunner
is never crashed.Note
This method is used automatically during
run()
and should never be explicitly called in your script.- if a key
-
_autodetect
()[source]¶ Try to autodetect the type of job scheduler.
The autodetection mechanism is very simple. For each entry in
config.gridrunner
the submit command followed by--version
is executed (for exampleqsub --version
). If the execution was successful (which is indicated by exit code 0) the job scheduler of corresponding type is present on the system and it is chosen. So if there are multiple different job schedulers installed, only one is picked – the one which “name” (indicated by a key inconfig.gridrunner
) is first in the lexicographical order.Returned value is one of
config.gridrunner
branches. If autodetection was not successful, an exception is raised.