4.5. Restarting an optimization

../../_images/LJ_Ar_restart_losses.png

You can continue an optimization from where a previous one stopped through the use of checkpoints. Checkpoints are snapshots in time of the optimization. They save to disk: trajectory data for all living optimizers, and their internal states.

This tutorial will show you how to:

  1. setup an optimization with checkpointing;

  2. continue an optimization from a checkpoint file.

This example will use the same problem as Getting Started: Lennard-Jones.

The example files can be found in $AMSHOME/scripting/scm/params/examples/LJ_Ar_restart.

Make a copy of the example directory.

4.5.1. Setting up checkpointing

Start the ParAMS GUI: First open AMSjobs or AMSinput, and then select SCM → ParAMS
Select File → Open
Select any of the example files to load the optimization

To change checkpointing settings go to the checkpointing panel:

Click on OptimizationPanel to open the optimization window
Select Details → Checkpoints

There are various points when one can make checkpoints:

  • at the start of the optimization (before any optimizers have been spawned);

  • at the end of the optimization (after the exit condition has triggered but before any optimizers are stopped);

  • every n function evaluations;

  • once every n seconds.

You can use the above options in any combination.

Tip

By default ParAMS will build checkpoints after every hour during an optimization.

We will make a checkpoint every 100 function evaluations:

Enter 100 next to Checkpoint interval (function evaluations)

We can also control how many checkpoints are kept.

By default only one checkpoint is kept, with newer ones overwriting older ones. You can retain more to restart further back in time.

We will keep 3 older checkpoints:

Enter 3 next to Number of older checkpoints to keep

Important

The above is all that is needed to setup checkpointing.

The following adjustments we make should normally not be set. We only do this in this tutorial to ensure we get the same sequencing of results for demonstrative purposes.

We begin with the optimizer, we will use a numerical seed in the random number generator:

Important

The optimizers provided by ParAMS are generally compatible with checkpointing. The Scipy optimizer is the only exception due to limitations in the Scipy code.

Options → Optimizers
In the CMA-ES Settings block enter: seed 112

We also adjust the settings to make the optimization totally serial:

Details → Technical
Set Loss function evaluations (per optimizer) to 1
Set Jobs (per loss function evaluation) to 1

Finally, we adjust the logging frequency to capture every point:

Details → Output
Under Training Set Logger set General to 1

Save and run the optimization:

File → Save As with the name initial.params
File → Run

Results will start to develop on the Graphs panel. Black vertical lines are appended to the loss graph to indicate the points where checkpoints have been constructed:

../../_images/LJ_Ar_restart_checklines.png

Fig. 4.13 Loss trajectory of the optimizer with vertical black lines showing the points of saved checkpoints.

Notice that the lines do not align perfectly with the requested interval of 100 evaluations. This is because some buffer is always needed to pause and sync the optimizers, and process all outstanding results.

Checkpoint files are saved in <jobname>.results/checkpoints. This folder also contains checkpoints.txt which lists the function evaluations corresponding to each of the checkpoints in the directory.

4.5.2. Restarting from a checkpoint

Open OptimizationPanel and select Main
Next to ResumeCheckpoint use DirButton to navigate to the earliest checkpoint (file ending with .tar.gz) in <jobname>.results/checkpoints

This is all that is required to resume from a checkpoint.

For a literal restart, do not change any of the other options.

However, you may change any other settings you like, the optimizers will continue as normal but they will operate within the new context.

There are times when this can be useful. Some examples:

  • changing stopping rules to make them more aggressive;

  • changing exit conditions to extend an optimization which has not yet converged;

  • changing training set weights / items to refocus an optimizer’s attention.

Warning

The only changes that are forbidden are those made to the parameter interface, this will result in an error.

For demonstrative purposes we will reduce our exit condition to 400 function evaluations:

Set Max loss function calls to 400

Save and run:

Warning

When resuming from a checkpoint you must save to a new job. If you overwrite an existing job this will result in an error.

File → Save As with the name resume.params
File → Run

Notice that as the results appear on the Graphs panel that they start after iteration 100. Also note that the optimization finishes as expected after 400 function evaluations.

../../_images/LJ_Ar_restart_resume.png

When resuming a checkpoint, results from a previous optimization will not be shown, but the evaluation numbers will be unique and continue from previous results so they can be linked.

For example, in the image below we have plotted together the running_loss.txt files produced by the initial run and the resumed run. You can see a perfect overlap between them.

../../_images/LJ_Ar_restart_losses.png

4.5.3. Resume from checkpoint with Python

See run.py:

#!/usr/bin/env amspython
from scm.plams import *
from scm.params import *
import numpy as np
import os
import matplotlib.pyplot as plt

def main():
    init()

    inputfile = os.path.expandvars('$AMSHOME/scripting/scm/params/examples/LJ_Ar_restart/params.in')

    job = ParAMSJob.from_inputfile(inputfile, name="initial")
    job.run()

    earliest_checkpoint_file = job.results.get_checkpoints()[0]

    restart_job = ParAMSJob.from_inputfile(inputfile, name="resume_from_checkpoint")
    restart_job.resume_checkpoint = earliest_checkpoint_file
    restart_job.settings.input.ExitCondition = []
    restart_job.add_exit_condition("MaxTotalFunctionCalls", 400)

    restart_job.run()

    evaluation, loss = job.results.get_running_loss()
    restart_evaluation, restart_loss = restart_job.results.get_running_loss()

    plt.plot(evaluation, np.log10(loss), restart_evaluation, np.log10(restart_loss))
    plt.legend(["Initial", "Resume from Checkpoint"])
    plt.xlabel("Evaluation id")
    plt.ylabel("log10(loss)")
    plt.show()

    finish()

if __name__ == '__main__':
    main()