Job manager

Job manager is the “commander” of PLAMS environment. It creates the structure of the working folder, manages its contents, and keeps track of all jobs you run.

Every instance of JobManager is tied to a working folder. This folder is created when JobManager instance is initialized and all the jobs managed by that instance have their job folders inside the working folder. You should not change job manager’s working folder after it has been created.

When you initialize PLAMS environment with the init() function, an instance of JobManager is created and stored in config.default_jobmanager This instance is tied to the main working folder (see The launch script for details) and used as a default every time some interaction with a job manager is required. In a normal situation you would never explicitly interact with a JobManager instance (create it manually, call any of its methods, explore its data etc.). All interactions are handled automatically from run() or other methods.

Technical

Usually there is no need to use any other job manager than the default one. Splitting work between multiple instances of JobManager may lead to some problems (different instances don’t communicate, so Rerun prevention does not work properly).

However, it is possible to manually create another instance of JobManager (with a different working folder) and use it for part of your jobs (by passing it as jobmanager keyword argument to run()). If you decide to do so, make sure to pass all instances of JobManager you manually created to finish() (as a list).

An example application for that could be running jobs within your script on many different machines (for example via SSH) and having a separate JobManager on each of them.

Rerun prevention

In some situations, for example when running many automatically generated small jobs, it may happen that two or more jobs are identical – they have the same input files. PLAMS has a built-in mechanism to detect such situations and avoid unnecessary work.

During run(), just before the actual job execution, a unique identifier (called hash) of a job is calculated. Job manager stores all hashes of previously started jobs and checks if the hash of the job you are just running has already occurred. If such a situation is detected, the execution of the current job is skipped and results of the previous job are used. Results from previous job’s folder can be either copied or linked to the current job’s folder, based on link_files key in previous job’s settings.

Note

Linking is done using hard links. Windows does not support hard links so if you are running PLAMS under Windows results are always copied.

The crucial part of the whole rerun prevention logic is a properly working hash() function. It is a function that takes the whole job instance and produces its hash. The hashing function needs to produce different hashes for different jobs and exactly the same hashes for jobs that do exactly the same work. It is far from trivial to come up with the scheme that works well for all kind of external binaries, since the technical details about job preparation can differ a lot. Currently implemented method works based on calculating SHA256 hash of input and/or runscript contents. The value of hashing key in job manager’s settings can be one of the following: 'input', 'runscript', 'input+runscript' (or None to disable the rerun prevention).

If you decide to implement your own hashing method, it can be done by overriding hash_input() and/or meth:~scm.plams.core.basejob.SingleJob.hash_runscript.

Warning

It may happen that two jobs with the same input and runscript files correspond to different jobs (for example, if they rely on some external file that is supplied using relative path). Sometimes it’s even a desired behavior to run multiple different copies of the same job (for example, multiple MD trajectories with the same starting point and random initial velocities). If you are experiencing problems (PLAMS refuses to run a job, becasue it was already run in the past), you can disable the rerun prevention with config.default_jobmanager.settings.hashing = None.

Hashing is disabled for MultiJob instances since they don’t have inputs and runscripts. Of course single jobs that are children of multijobs are hashed in the normal way, so trying to run exactly the same multijob twice will not trigger rerun prevention on the multijob level, but rather for every children job separately, effectively preventing any doubled work.

Pickling

The lifetime of the whole PLAMS environment is limited to a single script. That means every PLAMS script you run uses its own independent job manager, working folder and config settings. These objects are initialized at the beginning of the script with init() command and they cease to exist when the script ends. Also all the settings adjustments (apart from those done by editing Defaults file) are local and they affect only the current script.

As a consequence of that, the JobManager of the current script is not aware of any jobs that had been run in past scripts. But often it would be very useful to import a previously run job to the current script and use its results or build some new jobs based on it. For that purpose PLAMS offers data preservation for job objects. Every time an execution of a job successfully finishes (see Running a job), the whole job object is saved to a .dill file using Python serialization called pickling. Such a .dill file can be loaded (“unpickled”) in future scripts using load() function:

oldjob = load('/home/user/science/plams_workdir/myjob/myjob.dill')

This operation brings back the old Job instance in (almost) the same state it was just after its execution finished.

Note

The default Python pickling package pickle is not powerful enough to handle some of common PLAMS objects. Fortunately, the dill package provides an excellent replacement for pickle, following the same interface and being able to save and load almost everything. It is strongly recommended to use dill to ensure proper work of PLAMS data preservation logic. However, if dill is not installed for the Python interpreter you’re using to run PLAMS, the regular pickle package will be used instead (which can work if your Job objects are not too fancy, but in most cases it will probably fail). Please use dill, it’s free, easy to get and awesome.

The pickling mechanism follows references in pickled object. That means if an object you are trying to pickle contains a reference to another object (just like a Job instance has a reference to a Results instance), that other object is saved too. Thanks to that there are no “empty” references in your objects after unpickling. However, every Job instance in PLAMS has a reference to a job manager, which in turns has references to all other jobs, so pickling one job would effectively mean pickling the whole environment. To avoid that, every Job instance needs to be prepared for pickling by removing references to “global” objects, as well as some local attributes (path to the job folder for example). During loading, all the removed data is replaced with “proper” values (current job manager, current path to the job folder etc.).

Note

There is a way of expanding the mechanism explained above. If your Job object has an attribute with reference to an object you don’t want to save together with the job, you may add this object’s name to job’s _dont_pickle list:

myjob.something = some_big_and_clumsy_object_you_dont_want_to_pickle
myjob._dont_pickle.append('something')

That way big clumsy object will not be stored in the .dill file. After loading such a .dill file the value of myjob.something will simply be None.

The _dont_pickle list is an attribute of every Job instance, initially an empty list. It does not contain names of attributes that are always removed (like jobmanager), it’s meant only for additional ones defined by the user (see Job.__getstate__)

As mentioned above, pickling a job happens at the very end of run(). The decision if a job should be pickled is based on the pickle key in job’s settings, so it can be adjusted for each job separately. If you wish not to pickle a particular job just set myjob.settings.pickle = False. Of course the global default config.job.pickle can also be used.

If you modify a job or its corresponding Results instance after it has been pickler, these changes are not going to be reflected in the .dill file, since it was created before the changes happened. To update the state of the .dill file to include such changes you need to repickle the job manually by calling myjob.pickle() after doing your changes.

Note

Not all Python objects can be properly pickled, so you need to be careful what other objects your Job (or its Results) store references to.

The Results instance associated with the job is saved together with it. However, these results don’t contain all files produced by the job execution, but only relative paths to them. For that reason the .dill file is not enough to fully restore the job object if you want to extract or process the results. All other files present in the job folder are needed so that Results instance can see them. So if you want to copy a previously run job to another location make sure to copy the whole job folder (including subdirectories).

A job loaded with load() is not registered in the current job manager. That means it does not get its own subfolder in the current working folder, it never gets renamed and no Cleaning job folder is done on finish(). However, it is added to the hash registry, so it is visible to Rerun prevention.

In case of a MultiJob all the information about children jobs is stored in parent’s .dill file so loading a MultiJob results in loading all its children. Each child job can have its own .dill file containing information about that particular job only. When pickling, the parent attribute of a job is erased, so loading a child job does not result in loading its parent (and all other children).

Restarting scripts

Pickling and rerun prevention combine together into a handy restart mechanism. When your script tries to do something “illegal”, an exception is raised and the script gets terminated by the Python interpreter. Usually it is caused by a mistake in the script (a typo, using wrong variable, accessing wrong element of a list etc.). In such a case one would like to correct the script and run it again. But some jobs in the terminated script may had already been run and successfully finished before the exception occurred. It would be a waste of time to run those jobs again in the corrected script if they are meant to produce exactly the same results as previously. The solution is to load all successful jobs from the old script at the beginning of the new one and let Rerun prevention do the rest. But having to go to the old script’s working folder and manually get paths to all .dill files present there would be cumbersome. Fortunately, one can use load_all() function which takes a path to the main working folder of some finished PLAMS run and loads all .dill files present there. So when you edit your crashed script to remove mistakes you can add just one load_all() call at the beginning. Then you run your corrected script and no unnecessary work is done: all the finished jobs are loaded from the previous run, the current run tries to run the same jobs again, but Rerun prevention detects that and copies/links old jobs’ folders into the current main working folder.

If you’re executing your PLAMS scripts using the The launch script restarting is even easier. It can be done in two ways:

  1. If you wish to perform the restart run in a fresh, empty working folder, all you need to do is to import the contents of the previous working folder (from the crashed run) using -l flag:

    plams myscript.plms
    [17:28:40] PLAMS working folder: /home/user/plams_workdir
    #[crashed]
    #[correct myscript.plms]
    plams -l plams_workdir myscript.plms
    [17:35:44] PLAMS working folder: /home/user/plams_workdir.002
    

    This is eqivalent to putting load_all('plams_workdir') at the top of myscript.plms and running it with the usual plams myscript.plms.

  2. If you would prefer an in-place restart in the same working folder, you can use -r flag:

    plams myscript.plms
    [17:28:40] PLAMS working folder: /home/user/plams_workdir
    #[crashed]
    #[correct myscript.plms]
    plams -r myscript.plms
    [17:35:44] PLAMS working folder: /home/user/plams_workdir
    

    In this case the launch script will temporarily move all the contents of plams_workdir to plams_workdir.res, import all the jobs from there and start a regular run in now empty plams_workdir.

Note

Please remember that rerun prevention checks the hash of the job after the prerun() method is executed. So when you attempt to run a job identical to the one previously run (in the same script, or imported from a previous run), its prerun() method is executed anyway, even if the rest of Running a job is skipped.

API

class JobManager(settings, path=None, folder=None, use_existing_folder=False)[source]

Class responsible for jobs and files management.

Every instance has the following attributes:

  • foldername – the working folder name.

  • workdir – the absolute path to the working folder.

  • logfile – the absolute path to the logfile.

  • input – the absolute path to the copy of the input file in the working folder.

  • settings – a Settings instance for this job manager (see below).

  • jobs – a list of all jobs managed with this instance (in order of run() calls).

  • names – a dictionary with names of jobs. For each name an integer value is stored indicating how many jobs with that basename have already been run.

  • hashes – a dictionary working as a hash-table for jobs.

The path argument should be be a path to a directory inside which the main working folder will be created. If None, the directory from where the whole script was executed is used.

The foldername attribute is initially set to the folder argument. If such a folder already exists (and use_existing_folder is False), the suffix .002 is appended to folder and the number is increased (.003, .004…) until a non-existsing name is found. If folder is None, the name plams_workdir is used, followed by the same procedure to find a unique foldername.

The settings attribute is directly set to the value of settings argument (unlike in other classes where they are copied) and it should be a Settings instance with the following keys:

  • hashing – chosen hashing method (see Rerun prevention).

  • counter_len – length of number appended to the job name in case of a name conflict.

  • remove_empty_directories – if True, all empty subdirectories of the working folder are removed on finish().

__init__(settings, path=None, folder=None, use_existing_folder=False)[source]

Initialize self. See help(type(self)) for accurate signature.

load_job(filename)[source]

Load previously saved job from filename.

Filename should be a path to a .dill file in some job folder. A Job instance stored there is loaded and returned. All attributes of this instance removed before pickling are restored. That includes jobmanager, path (the absolute path to the folder containing filename is used) and default_settings (a list containing only config.job).

See Pickling for details.

remove_job(job)[source]

Remove job from the job manager. Forget its hash.

_register(job)[source]

Register the job. Register job’s name (rename if needed) and create the job folder.

If a job with the same name was already registered, job is renamed by appending consecutive integers. The number of digits in the appended number is defined by the counter_len value in settings. Note that jobs whose name already contains a counting suffix, e.g. myjob.002 will have the suffix stripped as the very first step.

_check_hash(job)[source]

Calculate the hash of job and, if it is not None, search previously run jobs for the same hash. If such a job is found, return it. Otherwise, return None

_clean()[source]

Clean all registered jobs according to the save parameter in their settings. If remove_empty_directories is True, traverse the working directory and delete all empty subdirectories.