# trainset.in¶

## Description of the trainset.in file¶

The trainset.in file contains the training set data and tells the program how to calculate the cost function $$F = \Sigma ((y_i - y^{ref}_i) / acc_i)^2$$ , which can be used to optimize the force field parameters. The trainset.in uses molecule identifiers, or keys, defined in the DESCRP field of the geo file (in BGF format), or in the models.in file, to compare force field derived geometries and energy differences to the reference values. The trainset.in has a free format as far as numbers concerned, although it does require that fields are space-separated. Besides, the “-“, “+” and “/” symbols have a special meaning in the trainset.in file and should not be used in identifiers. The trainset.in file is divided into 5 sections listed below. Each section begins with a start keyword and ends with the corresponding end keyword. The words in “CELL PARAMETERS” and “ENDCELL PARAMETERS” must be separated by exactly one space.

## Sections format¶

 Block name Start keyword End keyword Format charges CHARGE ENDCHARGE Key Acc Atom Ref geometries GEOMETRY ENDGEOMETRY Key Acc [Atom1 [Atom2 [Atom3 [Atom4]]] Ref] forces FORCES ENDFORCES Key Acc Atom Ref cell parameters CELL PARAMETERS ENDCELL PARAMETERS Key Acc Type Ref energy differences ENERGY ENDENERGY Acc [+-] Key1/n1 … [+-] Key5/n5 Ref heat of formation HEATFO ENDHEATFO Key Acc Ref

Key” is the molecule name from the geo file. “Atom” is an atom index in the corresponding molecule. “Acc” is a value of the target accuracy desired for the given error function contribution. This value is often called “weight” although in practice it is 1/weight. “Ref” is the reference value.

## Format description¶

In the all sections except “ENERGY” each data line starts with the structure identifier (the Key), followed by the “Acc” of the data point. This is followed by a type identifier. Each section contains following data entries:

• CHARGE

In the CHARGE section the type identifier is the number of the atom in the molecule and the reference value is its charge. Example:

CHARGE
#Key     Acc Atom  Ref
chexane 0.1    1   -0.15
ENDCHARGE

• GEOMETRY

In the GEOMETRY section the type ID is the list of atoms defining an internal coordinate (two for an interatomic distance, three for a valence and four for a torsion angle). When there is only one atom index specified, then the Eucledian distance for the given atom between the two geometries is calculated. When the index is -1 then an average Eucledian distance quantity between the two geometries is used instead. Please note that any reference value different from zero for the Eucledian distances does not make much sense. Besides, since these distances are computed in the Cartesian coordinates, which means that a simple translation of the molecule as a result of energy minimization may result in large Eucledian distances for otherwise similar geometries. If there is no identifier provided then it means that the ReaxFF RMS force will be compared with the reference (which should probably be zero in most cases). Example:

GEOMETRY
#Key      Acc At1 At2 At3 At4  Ref
chexane  0.01   1              0.0    # Eucledian distance between atom in the reference and the trial structure
chexane  0.01  -1              0.0    # Average Eucledian distance between atoms in the two structures
chexane  0.01   1   2          1.5    # Interatomic distance
chexane  1.00   1   2   3      120.0  # Valence angle
chexane  1.00   1   2   3   4  180.0  # Torsion angle
chexane  1.00                  0.0    # RMS force
ENDGEOMETRY

• CELL PARAMETERS

In the CELL PARAMETERS section the type IDs are names of the corresponding lattice parameters. Example:

CELL PARAMETERS
#Key         Acc   Type     Ref
chex_cryst  0.01    a      11.20
chex_cryst  0.01    b      11.20
chex_cryst  0.01    c      11.20
chex_cryst  0.01    alpha  90.00
chex_cryst  0.01    beta   90.00
chex_cryst  0.01    gamma  90.00
ENDCELL PARAMETERS

• HEATFO

The HEATFO section does not require a type ID as compares the ReaxFF heat of formation with the reference value. Example:

HEATFO
#Key      Acc   Ref
methane  2.00 -17.80
ENDHEATFO

• ENERGY

This section allows comparison of ReaxFF energy differences between structures to the reference data. In this case, each data line starts with the Acc of the data point, followed by up to five operator/identifier/divider parts and finishes with the reference value. The operator is either ‘+’ or ‘-‘ (‘+’ is the default). The energy associated with the identifier is divided by the divider, allowing comparison of condensed structures to monomers. The ‘/’ character in the ENERGY section data lines is optional. Example:

ENERGY
#Acc op1  Key1  n1 op2  Key2   n2     DeltaE
1.5   + butbenz/1  -  butbenz_a/1     -90.00
1.5   + butbenz/1  -  butbenz_b/1     -71.00
1.5   + butbenz/1  -  butbenz_c/1     -78.00
ENDENERGY