2.1. Introduction to parametrization

AMS provides many semi-empirical or empirical models to approximate energies and other properties of a system. For example, ReaxFF, UFF, DFTB, GFN-xTB, etc. These models are much faster than more accurate levels of theory like DFT, however, they rely on parameters to get their approximations right.

For many of these models, the parameter values used to accurately approximate the energy of one system will produce poor approximation of another. Often the process of manually changing the parameters to improve the approximation is quite laborious and difficult.

This is because parameters can be very abstract or have no physical interpretation. Sometimes, as in the case of ReaxFF, the number of parameters can become quite large making individual manual tuning impossible. Parameterization is the process of tuning a model’s parameters so that it it better able to predict the properties of the system under study. ParAMS is a tool designed to help you do this easily and methodically.

This tutorial will guide you through the theory of parameterization and introduce various ParAMS concepts. It also provides some general rules-of-thumb and practical guidance for approaching your own parametrizations.

If you are looking to dive right in, feel free to skip ahead to Getting Started: Lennard-Jones which provides a good introduction to the ParAMS software, or ReaxFF (advanced): ZnS which is a realistic parametrization of ReaxFF.

2.1.1. Training data

The first thing needed for a parametrization is training data. In other words they are measures of properties you would like the model to approximate as accurately as possible. Where does training data come from?

Training data can come from experiment or literature. Most commonly, however, training data comes from calculations done yourself at higher levels of theory like DFT. From these calculations (or jobs) one then extracts the information that they would like the model to get right. What items can I include in my training set?

Training data can be any combination of any physical properties: energies, forces, bond distances, bond angles, dihedral angles, charges, stresses, frequencies etc. If fact, ParAMS allows you include any value which can be extracted from an AMSJob in the training set. For a full list of available properties, and how to setup your own, see Extractors and Comparators. What types of jobs should I run?

The type of calculation you run depends on the data you would like to extract. The three most important types are:

  1. Single point: A single point calculation preserves the input geometry and calculates the properties of the fixed configuration of atoms. Use this to extract things like energies and forces, i.e., properties that are a function of a given geometry.

  2. Geometry optimization: This type of calculation uses the model (and a trial set of parameters) to calculate an energy landscape and move the atoms along this landscape until they reach an energy minimum. From this type of calculation you would extract things like bond lengths and bond angles. In some ways, this is the inverse of a single point calculation, you use these jobs to extract geometry information given an energy landscape.

  3. PES Scan: This is a series of jobs which freeze one or more degrees of freedom and relax the others. For each job in the series, the frozen coordinates are slightly adjusted. In other words, it is a series of geometry optimizations where the coordinates of two or more atoms are frozen such that you can scan along a bond stretch, angle bend, or volume change. From these scans one would typically extract energies, although geometry data from non-frozen atoms may also be of interest. For details see Linear Transit, PES scan.

A good training set will involve all calculation types to ensure that the model is able to roughly mimic all types of properties. What should I consider when designing my training set?

Unfortunately, training set design can be something of a black art. Below are some broad guidelines to consider when designing yours:

  1. Improving the prediction of one property will likely come at the expense of another. The parametrized models used to replicate the training set made sacrifices of accuracy for the sake of speed. The consequence is that most models will not have the flexibility to accurately predict all properties, and you should focus on including items in the training set which model the properties most important to you for a given application. This is not to say that overall improvements in all properties are not possible. Indeed, you will often see such an improvement, particularly when starting from a poor parameter set. However, at a certain threshold this will stop, and if the prediction quality is still not good enough, you will likely only be able to improve one set of predictions at the cost of another.

  2. Guard against overfitting. Closely, related to the above point is the issue of overfitting. If pushed too far, a model can predict one property almost perfectly and fail completely at others. Thus, it is important that the training set contains all sorts of data. This problem is particularly prevalent when the potential energy surface of a system has not been adequately sampled. For example, a training set containing only optimized geometries will likely produce a parameterized model that accurately predicts the energy minima, but is utterly unable to correctly predict bond lengths or other properties because it has no information about what distorted geometries look like.

  3. Use a validation set. A powerful way to check for overfitting is to use a validation set. A validation set contains data similar to, but distinct from, those in the training set. The items in the validation set are not actively used to help find better parameters, they serve only as a check to measure a model’s quality. Validation sets typically contain larger structures than the training set. The idea is to use information from simple molecules in a training set to predict the properties of more complex ones. When changing the parameters, if the quality of the model is improving for both sets, then it is likely you have found a reasonable model. If the quality of the training set predictions is improving, but the validation set predictions are getting worse, then you have likely started to overfit your data.

  4. Parametrization is an iterative process. Although the process of parametrization will definitely produce a model which matches the training set better, in practice, it is rare that a single parametrization will produce a model of sufficient quality for production runs. This is because there is almost always some unexpected/unphysical behavior caused by blind spots in the training set. For example, an interaction between two types of atoms not being included. Even in cases where they have been included they can be masked by other items in the set. This means the items need to be reweighted or new ones introduced and the process repeated.

2.1.2. Loss function

In order to get an overall measure of how well a model reproduces the items in the training set, we introduce a loss function. Loss functions take different forms like SSE, RMSE, MAE etc., but they all reduce the deviations between the training set and the values predicted by the model (with a given set of parameters) to a single value. The lower the loss value, the better the model is doing at reproducing the training data.

The parametrization task is to adjust the parameters until we find a model which minimizes this loss value. This is done through a global optimization using an optimizer. For those interested in how all the types of training data are reduced to a single loss value can consult Sigma vs. weight: What is the difference? and Extractors and Comparators. How do I balance my training set?

One issue that often arises when building a training set is how to emphasize one item over another. For example, as mentioned above, it is vital that both energy minima and distorted geometries are included in the training set, but it is more important to get the former right. For this ParAMS provides two ways to balance the training set through acceptable errors (or sigmas) and weights. For more information on this see Sigma vs. weight: What is the difference?.

Example: Let’s assume you have a medium sized organic molecule and you assign a weight of 0.01 to all C-H and all C-O bonds. This scheme will probably not result in a very high accuracy of the C-O bond energies. Why? Simply because there will be many, many more C-H bonds in the system than C-O bonds. Since all errors are going to be summed up, even small changes in the C-H bond energies will affect the value of the objective function more than a medium change in the C-O bond. The optimizer “sees” only a single number, so it’s important to make sure that the objective function is balanced (hence the usage of the terms weight) - or - if it’s biased, it should be biased towards the entries you consider important for your system. How do I minimize the loss function?

As mentioned above the minimization is done through a global optimization procedure. ParAMS provides several optimizers like CMA-ES and those implemented through SciPY, however you are free to use whatever strategy you like.

Be aware that the optimization itself is also an iterative procedure. All global optimizers include a level of stochasticity which means that repeated optimizations from the same starting point will often lead to different parameters sets. This is why we recommend repeated optimizations to, at the very least, verify that your first optimization did not get trapped in a very poor local minimum.

Sometimes the different sets will even have very similar loss function values, implying that they are equally good at describing your system. This may be another indication that the training set is deficient in some way. Comparing the performance of such sets using MD simulations, for example, can help identify the differences between them and guide improvements in the training set.

2.1.3. Model parameters

Returning to the model we are trying to improve, it is important to carefully consider which parameters you are interested in changing. This is often not obvious, particularly for a heavily parametrized model like ReaxFF which can have hundreds of parameters in a single force field. Changing all the parameters simultaneously will almost invariably fail. A global optimization in such an enormous space is virtually guaranteed to settle in a local minimum, and be terribly overfitted due to the presence of too many degrees of freedom.

As a consequence, it is best to once again develop iteratively. Incrementally turning on parameters as the training set is adjusted.

In the case of ReaxFF a good rule-of-thumb is:

  1. Start from an existing published force field that already models similar chemistry successfully;

  2. Optimize the charges. These are calculated by the EEM model (or optionally ACKS2) and have associated parameters in the force field;

  3. Optimize the bond parameters;

  4. Finally, optimize the angle parameters;

  5. Avoid changing atom and general parameters unnecessarily.

For more information about the ReaxFF parameter names and meanings, see Force field format specification or ReaxFF. For GFN1-xTB see GFN1-xTB.

2.1.4. Where to now?

Now that you have a broad contextualization of the parametrization process, you can proceed to learning about the ParAMS package itself. The subsequent tutorials are ordered by complexity. Earlier examples are very introductory and focus primarily on the software’s basics. Later examples are more realistic and introduce some of ParAMS more advanced features.