3.12.3. Data Set Sensitivity

Data Sets that require a large number of jobs for the evaluation will usually be the bottleneck of every parameter optimization. This class provides the possibility to estimate the diversity of a set prior to the fitting process. This is done by evaluating multiple smaller, randomly drawn subsets from the original set and reporting their loss function value. The values can then be compared to the full data set’s loss.

One example where this can be useful is when data sets are somewhat homogeneous. In such cases it can be useful to search for a smaller subset before training, thus reducing the optimization time. A smaller subset is a compromise of the size and error in loss function value as compared to the original set. The SubsetScan class can be used as an aide in such cases.

Assuming a Data Set instance ds with reference, a Job Collection jc that can be used to generate the results needed for the evaluation of our data set, and a parameter interface x is defined:

len(ds)
# 45600
len(ds.jobids)
# 45975
# Our data set is huge, lets see if it can be reduced without sacrificing much accuracy

# Initialize with DataSet, JobCollection and ParameterInterface
scan = SubsetScan(ds, jc, x, loss='rmse')

# This attribute stores the loss function value of the initial DataSet `ds`
fx0 = scan.fx0

# Decide on the number of jobs we would like to consider for a subset:
steps = [100, 500, 1000, 2500, 10000, 25000, 35000, 40000]

# At each step, evaluate n randomly created subsets:
reps_per_step = 20

# Now start the scan:
fx = scan.scan(steps, reps_per_step)

# The result is an array of (len(steps), reps_per_steps)
assert fx.shape == (8,20)

# Lets visualize the results:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size':20})
dim = fx.shape[-1]
for i in range(dim):
  plt.plot(steps, fx[:,i]/fx0)
plt.ylabel('fx/fx0')
plt.xlabel('Number of jobs in subset')
plt.xscale('log')
plt.tight_layout()

Note

If a results dictionary from JobCollection.run has previously been calculated and is available, MinJobSearch can also be instantiated without a job collection and parameter interface:

# Initialize with a results dictionary `results`
scan = MinJobSearch(ds, resultsdict=results, loss='rmse')

The resulting figure could look similar to the following,

../../_images/minjobsearch_results.png

in this case highlighting that the reduction to a subset of 10000 jobs would lead to a relative error of under 5% when compared to the evaluation of the full data set.

Note

Note that this example was created on a data set with only one property and equal weights for each entry. Real applications might not result in such homogeneous behavior.


API

class SubsetScan(dataset: scm.params.core.dataset.DataSet, jobcollection: scm.params.core.jobcollection.JobCollection = None, par_interface=None, resultsdict: Dict = None, workers: int = None, use_pipe=True, loss='rmse')

This class helps in the process of identifying a Data Set's sensitivity to the total number of jobs by consecutively evaluating smaller randomly drawn subsets. The resulting loss values can be compared to the one from the complete data set to determine homgeneity and help with size reduction or diversification of the set (see documentation for examples).

__init__(dataset: scm.params.core.dataset.DataSet, jobcollection: scm.params.core.jobcollection.JobCollection = None, par_interface=None, resultsdict: Dict = None, workers: int = None, use_pipe=True, loss='rmse')

Initialize a new search instance.

Parameters:
dataset : DataSet
The original data set instance. Will be used for subset generation. Reference values have to be present.
jobcollection : JobCollection
Job Collection instance to be used for the results calculation
par_interface : BaseParameters
A derived parameter interface instance, the associated engine will be used for the results calculation
resultsdict : dict({'jobid' : AMSResults}), optional
Instead of providing a job collection and parameter interface, an already calculated results dictionary can be passed. In this case initial results calculation will be skipped. The dict should be an output of JobCollection.run().
workers : int
When calculating the results, determines the number of jobs to run in parallel. Defaults to os.cpu_count()/2.
use_pipe : bool
When calculating the results, determines whether to use the AMSWorker interface.
loss : Loss, str

The loss function to be evaluated.

Important

Caution when using loss functions that do not average the error, such as the sum of squares error (sse). To ensure comparability loss values must be invariant to the data set size.

The fx0 attribute will store the initial data set’s loss function value.

scan(steps, reps_per_step=10)

Start the scan for data set subsets.

Parameters:
steps : List or Tuple
A list of integers, each entry represents the number of jobs that the original data set will be randomly reduced to and then evaluated
reps_per_step : int
Repeat every step n times, randomly drawing differnt entries to generate the subset.
Returns:
fx : ndarray
A 2d array of loss function values with the shape (len(steps), reps_per_step).
makesteps_exp(exponent: float, start: int = 10) → numpy.ndarray

Generate a number of exponentially increasing subset sizes such that

steps = []
while start <= len(ds.jobids):
    steps.append(int(start))
    start **= exponent
plotscan(steps, fx, filepath=None, ylim=None, xlogscale=True, boxwidths=None, backend=None)

Create a boxplot for the given steps and fx values

Parameters:
steps : ndarray
x values as returned by scan()
fx : ndarray
y values as returned by scan()
filepath : str
Path where the figure will be stored. If None, will plt.show() instead.
ylim : Tuple[float, float]
Lower/upper y limits on the plot
xlogscale : bool
Apply logarithmic scaling to the x-Axis. Choose depending on the spacing of steps
boxwidths : float or sequence of floats
Use this setting to adjust the box width
backend : str
The matplotlib backend to use