7.10.8.1. Working with Data Sets¶
The following features mainly work with the Data Set.
7.10.8.1.1. Data Set Sensitivity¶
Data Sets that require a large number of jobs for the evaluation will usually be the bottleneck of every parameter optimization. This class provides the possibility to estimate the diversity of a set prior to the fitting process. This is done by evaluating multiple smaller, randomly drawn subsets from the original set and reporting their loss function value. The values can then be compared to the full data set’s loss.
One example where this can be useful is when data sets are somewhat homogeneous.
In such cases it can be useful to search for
a smaller subset before training, thus reducing the optimization time.
A smaller subset is a compromise of the size and error in loss
function value as compared to the original set.
The SubsetScan
class can be used as an aide in such cases.
Assuming a Data Set instance ds with reference, a Job Collection jc that can be used to generate the results needed for the evaluation of our data set, and a parameter interface x is defined:
len(ds)
# 45600
len(ds.jobids)
# 45975
# Our data set is huge, lets see if it can be reduced without sacrificing much accuracy
# Initialize with DataSet, JobCollection and ParameterInterface
scan = SubsetScan(ds, jc, x, loss='rmse')
# This attribute stores the loss function value of the initial DataSet `ds`
fx0 = scan.fx0
# Decide on the number of jobs we would like to consider for a subset:
steps = [100, 500, 1000, 2500, 10000, 25000, 35000, 40000]
# At each step, evaluate n randomly created subsets:
reps_per_step = 20
# Now start the scan:
fx = scan.scan(steps, reps_per_step)
# The result is an array of (len(steps), reps_per_steps)
assert fx.shape == (8,20)
# Lets visualize the results:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size':20})
dim = fx.shape[1]
for i in range(dim):
plt.plot(steps, fx[:,i]/fx0)
plt.ylabel('fx/fx0')
plt.xlabel('Number of jobs in subset')
plt.xscale('log')
plt.tight_layout()
Note
If a results dictionary from JobCollection.run
has previously been calculated and is available, MinJobSearch can also be instantiated without
a job collection and parameter interface:
# Initialize with a results dictionary `results`
scan = MinJobSearch(ds, resultsdict=results, loss='rmse')
The resulting figure could look similar to the following,
in this case highlighting that the reduction to a subset of 10000 jobs would lead to a relative error of under 5% when compared to the evaluation of the full data set.
Note
Note that this example was created on a data set with only one property and equal weights for each entry. Real applications might not result in such homogeneous behavior.
API

class
SubsetScan
(data_set: scm.params.core.dataset.DataSet, job_collection: Optional[scm.params.core.jobcollection.JobCollection] = None, par_interface=None, resultsdict: Optional[Dict] = None, workers: Optional[int] = None, use_pipe=True, loss='rmse')¶ This class helps in the process of identifying a
Data Set's
sensitivity to the total number of jobs by consecutively evaluating smaller randomly drawn subsets. The resulting loss values can be compared to the one from the complete data set to determine homgeneity and help with size reduction or diversification of the set (see documentation for examples).
__init__
(data_set: scm.params.core.dataset.DataSet, job_collection: Optional[scm.params.core.jobcollection.JobCollection] = None, par_interface=None, resultsdict: Optional[Dict] = None, workers: Optional[int] = None, use_pipe=True, loss='rmse')¶ Initialize a new search instance.
 Parameters
 data_setDataSet
The original data set instance. Will be used for subset generation. Reference values have to be present.
 job_collectionJobCollection
Job Collection instance to be used for the results calculation
 par_interfaceBaseParameters
A derived parameter interface instance, the associated engine will be used for the results calculation
 resultsdict
dict({'jobid' : AMSResults})
, optional Instead of providing a job collection and parameter interface, an already calculated results dictionary can be passed. In this case initial results calculation will be skipped. The dict should be an output of
JobCollection.run()
. workersint
When calculating the results, determines the number of jobs to run in parallel. Defaults to os.cpu_count()/2.
 use_pipebool
When calculating the results, determines whether to use the AMSWorker interface.
 lossLoss, str
The loss function to be evaluated.
Important
Caution when using loss functions that do not average the error, such as the sum of squares error (sse). To ensure comparability loss values must be invariant to the data set size.
The
fx0
attribute will store the initial data set’s loss function value.

scan
(steps, reps_per_step=10)¶ Start the scan for data set subsets.
 Parameters
 stepsList or Tuple
A list of integers, each entry represents the number of jobs that the original data set will be randomly reduced to and then evaluated
 reps_per_stepint
Repeat every step n times, randomly drawing differnt entries to generate the subset.
 Returns
 fxndarray
A 2d array of loss function values with the shape (len(steps), reps_per_step).

makesteps_exp
(exponent: float, start: int = 10) → numpy.ndarray¶ Generate a number of exponentially increasing subset sizes such that
steps = [] while start <= len(ds.jobids): steps.append(int(start)) start **= exponent

plotscan
(steps, fx, filepath=None, ylim=None, xlogscale=True, boxwidths=None, backend=None)¶ Create a boxplot for the given steps and fx values
 Parameters
 stepsndarray
x values as returned by
scan()
 fxndarray
y values as returned by
scan()
 filepathstr
Path where the figure will be stored. If None, will plt.show() instead.
 ylimTuple[float, float]
Lower/upper y limits on the plot
 xlogscalebool
Apply logarithmic scaling to the xAxis. Choose depending on the spacing of steps
 boxwidthsfloat or sequence of floats
Use this setting to adjust the box width
 backendstr
The matplotlib backend to use

7.10.8.1.2. Normalization of Data Set Weights¶

normalize_weights
(ds: scm.params.core.dataset.DataSet, resultsdict: Dict[str, scm.plams.interfaces.adfsuite.ams.AMSResults], extractors: Optional[List[str]] = None, set_best=True, loss='rmse', maxiter=1000, verbose=True)¶ Normalize a data set’s, weights by minimizing the standard deviation of all individual contributions in that data set.
This is done through the optimization of a weights vector, where each weight is applied to all entries with the same extractor, e.g. for a data set that contains forces and energies, all entries’ weights that contain the former extractor will be optimized to one value, and weights of the latter to a different value.Note
New weights are applied through multiplication with the initial values, in order to preserve the shape, i.e.,
w_new = w_0 * x
. Consider setting relevant initial weights to 1. if this is not the desired behavior. Example
>>> jc = JobCollection() >>> ... # populate the job collection >>> ds = DataSet() >>> ... # populate the data set and calculate the reference values, assuming we are adding energies, forces and charges >>> results = jc.run(interface) # run all jobs needed for the evaluation of ds >>> minres = normalize_weights(ds, results, extractors=['energy', 'forces', 'charges']) >>> minres.x array([ 2.37, 27.59, 1. ]) >>> ds.energy()[0].weight 2.37 >>> ds.forces()[0].weight 27.59 >>> ds.charges()[0].weight 1.
 Parameters
 ds: DataSet
Data Set instance to be evaluated
 resultsdict: dict
A {name:AMSResults} dict with results that can be used to evaluate ds
 extractorsList of strings
List of extractors that should be considered for the minimization. Should all be present in the Data Set, will raise a error otherwise. Will use all extractors in the ds by default.
 set_bestbool
Whether to set the weights after the optimization or not
 lossstr
The ParAMS loss to use
 maxiterint
Number of maximum function evaluations
 verbosebool
Whether to print initial and final losses (std deviation of the contributions vector)
 Returns
minres : scipy.minimizeresult