# 3.3. Data Set¶

Following the Introduction, the DataSet class represents a set of physicochemical properties that the user would like to see reproduced by a parametric model. We define an arbitrary property $$P$$ as a function of a computational Job. A single entry in the DataSet can be expressed as a linear combination of multiple properties:

(3.1)$y = c_1 P_1(J_1) + c_2 P_2(J_1) + ... + c_N P_N(J_O)$

For every property $$P$$ there exists an extractor which is linked to the Data Set. Extractors define how to access the desired property from a calculated job and allow the user to easily extend the scope of what can be fitted during the parameterization.

Following the above equation, the DataSet has two purposes:

• Storage of all reference values $$\boldsymbol{y}$$
• Consecutive comparison of the values predicted by the parametric model $$\boldsymbol{\hat{y}}$$ to the reference with a specific Loss function

Just like any of the collections, the DataSet can be stored in a text format, which looks as following:

---
Expression: forces("H2O_1")
Weight: 100.0
Unit: kj/(mol*angstrom), 1.1858e3
ReferenceValue: |
array([ -0.,          -0.,       1.75e-5,
-0.,          -3.45e-5, -1.75e-5,
-0.,           3.45e-5,  0.       ])
---
Expression: angles("H2O_2",1,2,3)
Weight: 0.333
Sigma: 0.5
---
Expression: energy("H2O_3")
Weight: 1.0
ReferenceValue: -3.44
...


At runtime, the Data Set behaves like a list. Each element in it is an instance of DataSetEntry, with the following attributes:

• expression
• weight
• unit
• reference
• sigma
• jobids (read-only)
• extractors (read-only)

Addition of entries follows Eq (3.1), a weight has to be provided alongside each entry, marking the relative ‘importance’ of this entry in the total Data Set (see loss functions to see how the weight influences the total loss):

>>> ds = DataSet()
>>> # Adding without a reference value or unit:
>>> # Adding with a reference value:
>>> ds.add_entry("energy('Methane01') - 4*energy('Hydrogen')", 15.0, reference=0.033)


The default property units of ParAMS are atomic units (au). If the user wishes to fit properties in any other units, an additional unit argument has to be provided in form of a tupel:

>>> ds.add_entry("energy('Methane02')", 2.0, unit=('kcal/mol',6.2751e+02))


Here, the first item is the string name of the desired unit and the second item is a conversion factor from atomic units to the unit of choice.

Important

The default property units in ParAMS are atomic units. For any other units a conversion factor from au to the desired unit has to be provided.

>>> m = {'Origin' : 'A very important journal', 'Version' : 1.1}
>>> print(ds[-1])
---
Expression: forces('Methane02')
Weight: 1.0
Origin: 'A very important journal'
Version: 1.1


## 3.3.2. Accessing Entries¶

>>> ds["forces('Methane01')"] # If the expression is known.
>>> ds[0].expression          # Otherwise the class can be accessed as a list:
"forces('Methane01')"
>>> ds[0].weight
15.0
>>> ds[0].unit
('au', 1.)
>>> ds[0].reference
np.array([...])
>>> ds[0].jobids # The ID of the job linking to the AMSResults value. Needs to be the key in the dictionary passed to calculate_reference() and evaluate().
{"Methane01"}
>>> ds[0].extractors # Set of extractors that will be called when evaluating
{'forces'}


## 3.3.3. Removing Entries¶

>>> len(ds)
10
>>> entry = ds[3]                     # Single entry
>>> entry = ds["forces('Methane01')"] # Same, but with the expression known
>>> ds.index(entry)
3
>>> del ds[3]        # Delete by index
>>> ds.remove(entry) # Or delete by entry
>>> len(ds)
9


The del method can also delete multiple indices at once:

>>> len(ds)
9
>>> del_ids = [1,2,3]
>>> del ds[del_ids]
len(ds)
6


## 3.3.4. Adding External Reference Data¶

External reference data can be easily provided with the reference argument when calling add_entry(). If your reference data is not in atomic units, make sure to either convert it to au before adding, or specify a proper conversion factor from a.u. to your desired unit in the unit argument:

>>> forces = np.array([[-0, -0, 1.75e-5], [-0, -3.45e-5, -1.75e-5], [-0, 3.45e-5, 0]])
>>>
>>> ds = DataSet()
>>> ds.add_entry('energy("H2O_3")'     ,  1.  , reference=92.551, unit=('kcal/mol' 6.2751e2))


Unit has to be a list or tuple where the first element is a string with the target unit name and the second element is the conversion factor from au to target.

>>> ds['energy("H2O_3")'].reference = 12345


## 3.3.5. Calculating and Adding Reference Data with AMS¶

If the reference data is not yet available to the Data Set, the user can choose to extract it from an AMS calculation. By doing so, any AMS Engine can be used as the Reference Engine. Assuming a Data Set with a single entry

>>> ds = DataSet()
>>> print(ds)
---
Expression: angles("H2O_2",1,2,3)
Weight: 1
Sigma: 1.0
...


the user has multiple ways of calculating the reference values. The fastest would be through the use of JobCollection.run, if a Job Collection is available. Here, we make use of the fact that system geometry and job settings are already stored in the collection. The only additional input that is needed in this case is the selection of a reference engine:

>>> from scm.plams import Settings
>>> engine = Settings()
>>> engine.input.MOPAC # Our reference engine is MOPAC
>>> r = jc.run(engine, jobids=['H2O_2']) # r is a dictionary of {jobid : AMSResults}


Alternatively, if no Job Collection is yet created, the regular PLAMS syntax can be used:

>>> from scm.plams import *
>>> mol = Molecule('H2O_2.xyz')
>>> s = Settings()
>>> s.input.MOPAC
>>> init()
>>> job = AMSJob(name='H2O_2', molecule=mol, settings=s)
>>> r   = job.run()
>>> r   = {'H2O_2' : r} # The DataSet expects a dictionary


The above examples use the plams.Settings class. See the PLAMS documentation for more information about it.

We have now created and executed a reference job. Passing it to the Data Set can be done with calculate_reference():

>>> ds.calculate_reference(r)


The Data Set will automatically extract and store the correct results:

>>> print(ds) # Check our objective function
---
Expression: angles("H2O_2",1,2,3)
Weight: 1
ReferenceValue: 103.87
Sigma: 1.0
...


Warning

By default, when calculate_reference() is called, possibly present reference values will be overwritten with the ones extracted from plams.AMSResults. You can set overwrite=False if you want to override this behavior.

## 3.3.6. Storage and I/O¶

Once all reference values are computed, it might be useful to store them. The Data Set can be stored in the YAML format, or pickled if disk space and speed is more important than readability:

>>> ds.store('dataset.yml') # Same as print(ds) to file
>>> ds.pickle_dump('costfunction.pkl') # Binary


The function can then be restored with:

>>> ds = DataSet('costfunction.yml')


Alternatively, when an instance is already present:

>>> ds.load('dataset.yml') # or


## 3.3.7. Calculating the Loss Function Value¶

If each entry has a reference value, the computational results of any engine can be evaluated and compared by a suitable loss function. This might be useful when for a manual, quantitative comparison between different parameters and/or engines. The evaluation mostly follows the same signature as the calculation of reference data.

Note

The execution of jobs and evaluation of the Data Set is handled automatically during the Optimization. In most cases the user does not need to manually reproduce the workflow covered in this section.

>>> from scm.plams import Settings
>>> engine_settings = Settings()
>>> engine_settings.input.DFTB
>>> jobresults = jc.run(engine_settings) # Assuming we already have a JobCollection() in jc
>>> ds.evaluate(r, loss='rmse') # Assuming we already have a DataSet() in ds
1234.5


Important

The Data Set value can only be calculated (i.e. evaluate()) when each entry in it has a reference value.

## 3.3.8. Checking for Consistency with a given Job Collection¶

Data Set entries are tied to a JobCollection by a common jobID. The consistency of every DataSet instance can be checked with the DataSet.check_consistency() method. It will check if any DataSetEntry has jobids that are not included in a JobCollection instance and if so, return a list of indices with entries that can not be calculated given the Job Collection:

>>> # DataSet() in ds, JobCollection() in jc
>>> len(ds)
>>> 10
>>> bad_ids # ds entries with these indices could not be calculated given jc, meaning the property for the calculation of these entries requires a chemical system which is not present in jc
[1,10,13]
>>> len(ds)
7
>>> ds.check_consistency(jc) # No more issues
[]


The DataSet.check_consistency() method is equivalent to:

>>> bad_ids = [num for num,entry in enumerate(ds) if any(i not in jc for i in entry.jobids)]


Removing Entries

## 3.3.9. Splitting into Subsets¶

The Data Set can be further split into subsets. Each subset is a new instance of DataSet. If you are interesed in splitting it by size, see the methods:

Splitting can also be based on extractors:

>>> ds.forces() # Return a :class:DataSet instance with forces only.
>>> ds.angles() # Return a :class:DataSet instance with angles only. And so on.


These methods are created automatically based on the extractors available to the instance.

Limiting a Data Set to only specific jobIDs can be done with the from_jobids() method.

## 3.3.10. Data Set Entry API¶

class DataSetEntry(expression, weight, reference=None, unit=None, sigma=None, metadata=None)

A helper class representing a single entry in the DataSet.

Note

This class is not public in this module, as it is not possible to create these entries on their own. They can only be managed by an instance of a DataSet class.

weight : float or ndarray
The weight $$w$$, denoting the relative ‘importance’ of a single entry. See Loss Functions for more information on how this parameter affects the overall loss.
reference : Any
Reference value of the entry. Consecutive DataSet.evaluate() calls will compare to this value.
unit : Tuple[str,float]
The unit of this property
sigma : float
To calculate the loss function value, each entry’s residuals are calculated as $$(w/\sigma)(y-\hat{y})$$. Ideally, sigma represents the accepted ‘accuracy’ the user expects from a particular entry.
Default values differ depending on the property and are stored in individual extractor files. If the extractor file has no sigma defined, will default to 1 (see Extractors and Comparators for more details).
expression : str
The expression that will be evaluated during an DataSet.evaluate() call. A combination of extractors and jobids.
All Job IDs needed for the calculation of this entry
All extractors needed for the calculation of this entry

## 3.3.11. Data Set API¶

class DataSet(file=None, more_extractors=None)

A class representing the data set $$DS$$.

Attributes:

extractors_folder : str
Default extractors location.
extractors : set
Set of all extractors available to this instance.
jobids : set
Set of all jobIDs in this instance.
A string that will be printed at the beginning of the file when store() is called.
__init__(file=None, more_extractors=None)

Initialize a new DataSet instance.

Parameters:

file : str
Load a a previously saved DataSet from this path.
more_extractors : str or List[str]
Path to a user-defined extractors folder.
add_entry(expression, weight, reference=None, unit=None, sigma=None, metadata=None, dupe_check=True)

Adds a new entry to the cost function, given the expression, the desired weight and (optionally) the external reference value.
Skips checking the expression for duplicates if dupe_check is False. This is recommended for large cost functions >20k entries.

expression: str
The string representation of the extractor and jobid to be evaluated. See Adding Entries and Extractors and Comparators for more details.
weight: float, 1d array
Relatie weight of this entry. See Adding Entries for more details, see Loss Functions to check how weights influence the overall value.
reference: optional, Any
External reference values can be provided through this argument.
unit : optional, Tuple[str, float]
Whenever the reference needs to be stored in any other unit than the default (atomic units), use this argument to provide a tuple where the first element is the string name of the unit and the second is the conversion factor from atomic units to the desired unit.
See Adding External Reference Data for more details.
sigma : optional, float
To calculate the loss function value, each entry’s residuals are calculated as $$(w/\sigma)(y-\hat{y})$$. Ideally, sigma represents the accepted ‘accuracy’ the user expects from a particular entry.
Default values differ depending on the property and are stored in individual extractor files. If the extractor file has no sigma defined, will default to 1 (see Extractors and Comparators for more details).
Optional metadata, will be printed when store() is called, can be accessed through each entry’s metadata argument.
dupe_check: True
If True, will check that every expression is unique per Data Set instance. Disable if working with large data sets.
Rather than adding an entry multiple times, consider increasing its weight.
calculate_reference(results, overwrite=True)

Calculate the reference values for every entry, based on results.

results : dict

A {jobid : AMSResults} dictionary:

>>> job = AMSJob( ... )
>>> job.run()
>>> DataSet.calculate_reference({job.name:job.results})


Can be ‘sparse’ (or empty == {}), as long as the respective entry (or all) alrady has a reference value defined.

overwrite : bool
By default, when this method is called, possibly present reference values will be overwritten with the ones extracted from the results dictionary. Set this to False if you want to override this behavior.
evaluate(results, loss: Type[scm.params.core.lossfunctions.Loss] = 'sse', return_residuals=False)

Compares the optimized results with the respective reference. Returns a single cost function value based on the loss function.

results : dict
Same as calculate_reference().
loss : Loss, str
A subclass of Loss, holding the mathematical definition of the loss function to be applied to every entry, or a registered string shortcut.
return_residuals : bool
Whether to return residuals and contributions in addition to fx.
fx : float
The overall cost function value after the evaluation of results.
residuals : List[1d-array]
List of unweighted per-entry residuals (or the return values of compare(y,yhat) in the case of custom comparators)
contributions : List[float]
List of relative per-entry contributions to fx
get(key: str)

Return a list of per-entry attributes, where key determines the attribute, equivalent to:
[getattr(i,key) for i in self].

set(key: str, values: Sequence)

Batch-set the key attribute of every entry to a value from values. values must be the same length als the number of entries in the DataSet instance.
Equivalent to [setattr(e,key,v) for e,v in zip(self,values)].

load(yamlfile='dataset.yaml')

Loads a DataSet from a YAML file.

store(yamlfile='dataset.yaml')

Stores the DataSet to a YAML file.

pickle_load(fname)

Loads a DataSet from a pickled file.

pickle_dump(fname)

Stores the DataSet to a pickled file.

copy()

Return a deep copy of the instance.

__contains__(key)

Key should be a full expression (for jobids only, iterate over jobids)!

__getitem__(idx)

Dict-like behavior if idx is a string, list-like if an int

__delitem__(idx: Union[int, Sequence[int]])

Delete one entry at index idx. Can either be an index (int) or an expression (str). In both cases a sequence can be passed to delete multiple elements at once

remove(entry)

Remove the DataSetEntry instance from this Data Set

index(entry)

Return the index of a DataSetEntry instance in this Data Set

__len__()

Return the length of this Data Set

__iter__()

Iterare over all DataSetEntries in this Data Set

__call__(key)

Same as __getitem__().

__eq__(other)

Check if two Data Sets are equal.

__ne__(other)

Return self!=value.

__add__(other)

Add two Data Sets, returning a new instance. Does not check for duplicates.

__sub__(other)

Substract two Data Sets, returning a new instance. Does not check for duplicates.

__str__()

Return a string representation of this instance.

jobids

Return a set of all jobIDs necessary for the evaluation of this instance.

check_consistency(jc: scm.params.core.jobcollection.JobCollection) → Sequence

Checks if this instance is consistent with jc, i.e. if all entries can be calculated from the given JobCollection.

jc : JobCollection
The Job Collection to check against

List of entry indices that contain jobIDs not present in jc.
Use del self[i] to delete the entry at index i.

maxjobs(njobs, seed=None, _patience=100, _warn=True)

Returns a random subset of self, reduced to len(DataSet.jobids) <= njobs AMS jobs.

Note

Can result in a subset with a lower (or zero) number of jobs than specified. This happens when the data set consists of entries that are computed from multiple jobs and adding any single entry would result in a data set with more jobs than requested.
Will warn if a smaller subset is generated and raise a ValueError if the return value would be an empty data set.

split(*percentages, seed=None)

Returns N=len(percentages) random subsets of self, where every percentage is the relative number of entries per new instance returned.

>>> len(self) == 10
>>> a,b,c = self.split(.5, .2, .3) # len(a) == 5 , len(b) == 2, len(c) == 3, no overlap

random(N, seed=None)

Returns a new subset of self with len(subset)==N.

from_jobids(jobids: Set[str])

Generate a subset only containing the provided jobids

print_contributions(contribs, fname, sort=True)

Print contribs to fname, assuming the former has the same ordering as the return value of get().

contribs : List
Return value of evaluate()
fname : str
File location for printing
sort : bool
Sort the contributions from max to min. Original order if disabled.
print_residuals(residuals, fname, weigh=True, extractors: List[str] = None)

Print residuals to fname, assuming the former has the same ordering as the return value of get(). Entries can be limited to certain extractors with the respective keyword argument.

residuals : List
Return value of evaluate()
fname : str
File location for printing
weigh : bool
Whether or not to apply the weights when printing the residuals
extractors : optional, List[str]
If provided, will limit the extractors to only the ones in the list. Defaults to all extractors.
get_predictions(residuals: List, fpath: str = None, extractors: List = None) → List

Return the absolute predicted values for each data set entry based on the residuals vector as returned by evaluate(), optionally print reference and predicted values for each entry to file.

residuals : List
Return value of evaluate()
fpath : optional, str
If provided, will print EXPRESSION, SIGMA, WEIGHT, REFERENCE, PREDICTION to file.
extractors: optional, Sequence of strings
A list of extractors to be included in the output. Includes all by default.
preds : List of 2-Tuples
Returns a list where each element is a tuple of (expression, predicted value).