3.7. Optimization¶
The Optimization
class is where the other components
– Job Collection, Data Set, Parameter Interfaces and Optimizers – come together.
It is responsible for the selection, generation, execution and evaluation of new jobs for every new parameter set.
See also
Architecture Quick Reference for an overview
A Optimization
instance will usually be initialized once every other component is defined:
>>> interface = ReaxParams('path/to/ffield.ff')
>>> jc = JobCollection('path/to/jobcol.yml')
>>> training_set = DataSet('path/to/dataset.yml')
>>> optimizer = CMAOptimizer(popsize=15)
>>> optimization = Optimization(jc, training_set, interface, optimizer)
Once initialized, the following will run a complete optimization:
>>> optimization.optimize()
After instantiation, a summary of all relevant settings can be printed with summary()
:
>>> optimization.summary()
Optimization() Instance Settings:
=================================
Workdir: opt
JobCollection size: 20
Interface: ReaxParams
Active parameters: 207
Optimizer: CMAOptimizer
Callbacks: Timeout
Logger
Evaluators:
-----------
Name: trainingset (_LossEvaluator)
Loss: SSE
Evaluation frequency: 1
Data Set entries: 20
Data Set jobs: 20
Batch size: None
CPU cores: 6
Use PIPE: True
---
===
3.7.1. Optimization Setup¶
The optimization can be further controlled by providing a number of optional keyword arguments to the Optimization
instance.
While the full list of arguments is documented in the API section below,
the most relevant ones are presented here.
- parallellevels
- An instance of the ParallelLevels class describing how the optimization is to be parallelized.
- constraints
- Constraints additionally define the parameter search space by checking if every solution is consistent with the definition.
- callbacks
- A list of callback instances. Callbacks provide a versatile way to interact with the optimization process at every iteration.
- validation
- Percentage of the training_set entries to be used for validation. Can be used with the Early Stopping callback.
- loss
- The loss function to be used for this optimization instance.
- batch_size
- Instead of evaluating all properties in the training_set, evaluate a maximum of randomly picked batch_size entries per iteration.
3.7.2. Optimization API¶
-
class
Optimization
(jobcollection: scm.params.core.jobcollection.JobCollection, datasets: Union[scm.params.core.dataset.DataSet, Sequence[scm.params.core.dataset.DataSet]], parameterinterface: Type[scm.params.parameterinterfaces.base.BaseParameters], optimizer: Type[scm.params.optimizers.base.BaseOptimizer], title: str = 'opt', plams_workdir_path: str = None, validation: float = None, callbacks: Sequence[scm.params.core.callbacks.Callback] = None, constraints: Sequence[scm.params.parameterinterfaces.base.Constraint] = None, parallel: scm.params.common.parallellevels.ParallelLevels = None, verbose: bool = True, skip_x0: bool = False, loss: Union[scm.params.core.lossfunctions.Loss, Sequence[scm.params.core.lossfunctions.Loss]] = 'sse', batch_size: Union[int, Sequence[int]] = None, use_pipe: Union[bool, Sequence[bool]] = True, dataset_names: Sequence[str] = None, eval_frequency: Union[int, Sequence[int]] = 1, maxjobs: Union[None, Sequence[int]] = None, maxjobs_shuffle: Union[bool, Sequence[bool]] = False)¶ The top level class managing an entire optimization.
-
__init__
(jobcollection: scm.params.core.jobcollection.JobCollection, datasets: Union[scm.params.core.dataset.DataSet, Sequence[scm.params.core.dataset.DataSet]], parameterinterface: Type[scm.params.parameterinterfaces.base.BaseParameters], optimizer: Type[scm.params.optimizers.base.BaseOptimizer], title: str = 'opt', plams_workdir_path: str = None, validation: float = None, callbacks: Sequence[scm.params.core.callbacks.Callback] = None, constraints: Sequence[scm.params.parameterinterfaces.base.Constraint] = None, parallel: scm.params.common.parallellevels.ParallelLevels = None, verbose: bool = True, skip_x0: bool = False, loss: Union[scm.params.core.lossfunctions.Loss, Sequence[scm.params.core.lossfunctions.Loss]] = 'sse', batch_size: Union[int, Sequence[int]] = None, use_pipe: Union[bool, Sequence[bool]] = True, dataset_names: Sequence[str] = None, eval_frequency: Union[int, Sequence[int]] = 1, maxjobs: Union[None, Sequence[int]] = None, maxjobs_shuffle: Union[bool, Sequence[bool]] = False)¶ Parameters: - jobcollection :
JobCollection
- Job collection holding all jobs necessary to evaluate the datasets
- datasets :
DataSet
, list(DataSet
) - Data Set(s) to be evaluated.
In the most simple case, one data set will be evaluated as the training set. Multiple data sets can be passed to be evaluated sequentially at every optimizer step. In this case, the first data set will be interpreted as the training set, the second as a validation set. - parameterinterface : any parameter interface
- The interface to the parameters that are to be optimized.
- optimizer : optimizer class
- An instance of an optimizer class.
- title : optional, str
- The working directory for this optimization.
Once
optimize()
is called, will switch to it. - plams_workdir_path : optional, str
- The folder in which the PLAMS working directory is created. By default the PLAMS working directory is created inside of the working directory of the optimization, see the title keyword argument. When running on a compute cluster this variable can be set to a local directory of the machine where the jobs are running (e.g. $SCM_TMPDIR), avoiding a potentially slow PLAMS working directory that is mounted over the network.
- validation : optional, float, int
- If the passed value is 0<float<1, a validation set will be created from a validation percentage of the first data set in datasets.
If the passed value is 1<float<len(datasets[0]), will create a validation set with validation entries taken from the first data set in datasets.
If you would like to pass a
DataSet
instance instead, you can do so in the datasets parameter. - callbacks : optional, List of callback instances
- List of callbacks interacting with the optimization instance.
- constraints : optional, List of parameter constraints
- Additional constraints for candidate solutions of \(\boldsymbol{x}^*\). If the any of these return False, the solution will not be considered.
- parallel : optional, ParallelLevels
- Configuration for the parallelization at all levels of a parameter optimization.
- verbose : bool
- Print the current best loss function value each time we improve
- skip_x0 : bool
- Before an optimization process starts, a DataSet will be evaluated with the initial parameters \(oldsymbol{x}_0\).
If this initial evaluation returns an infinite loss function value, will raise an error by default.
This behavior is expecting that the initial parameters are generally valid and the cause of the non finite loss is probably
due to bad
plams.Settings
of an entry in theJobCollection
.
However, if it is not known if the initial parameters can be trusted or raising an error is not desired for other reasons, this parameter can be set to True to skip the initial evaluation.
Per Data Set Parameters: Note
The following parameters will be applied to all entries in datasets, meaning each Data Set will be evaluated with the same settings. To override this, any of the parameters below can also take a list with the same number of elements as
len(datasets)
, mapping individual settings to every datasets entry.- loss : optional, Loss, str
- A Loss Function instance to compute the loss of every new parameter set. Residual Sum of Squares by default.
- batch_size : optional, int
The number of entries to be evaluated per epoch. If None, all entries will be evaluated.
Note: One job calculation can have multiple property entries in a training set (e.g. Energy and Forces), thus, this parameter is not the same as as `maxjobs.
Note: If both, maxjobs and batch_size are set, the former will be applied first. If the resulting set is still larger than batch_size, will apply filtering by batch_size.
- use_pipe : optional, bool
- Whether to use the AMSWorker interface for suitable jobs.
- dataset_names : optional, List[str]
- When evaluating multiple datasets, can be set to give each entry a name.
Possible logger callbacks will create and write data into this subdirectory.
Defaults to['trainingset', 'validationset', 'dataset03', ..., 'datasetXX']
- eval_frequency : optional, int
Evaluate the Data Set at every eval_frequency call.
Warning
The first entry in datasets represents the training set and must be evaluated at every call. It’s frequency will always be 1.
- maxjobs : optional, int
- Whether to limit each Data Set evaluation to a subset of maximum maxjobs. Igonored if None.
- maxjobs_shuffle : optional, bool
- If maxjobs is set, will generate a new subset of the Data Set with maxjobs at every evaluation.
- jobcollection :
-
optimize
() → scm.params.optimizers.base.MinimizeResult¶ Start the optimization given the initial parameters
-
summary
(file=None)¶ Prints a summary of the current instance
-
__str__
()¶ Return str(self).
-
delete
()¶ Remove the working directory from disk
-