### Add an entry

In [1]:
from scm.params import *
import numpy as np

data_set = DataSet()
data_set.add_entry("angle('H2O', 0, 1, 2)", weight=0.333)

To **access the last added element**, use ``data_set[-1]``

In [2]:
print("String representation of data_set[-1]")
print(data_set[-1])
print("Type: {}".format(type(data_set[-1])))

String representation of data_set[-1]
---
Expression: angle('H2O', 0, 1, 2)
Weight: 0.333
Unit: degree, 1.0

Type: 


You can also **change it after you've added it**:

In [3]:
data_set[-1].sigma = 3.0
print(data_set[-1])

---
Expression: angle('H2O', 0, 1, 2)
Weight: 0.333
Sigma: 3.0
Unit: degree, 1.0



We recommend to always specify the *reference value*, the *unit*, and the *sigma* value when adding an entry, and also to specify any meaningful *metadata* about the data set entry.

In [4]:
data_set.add_entry(
 "energy('H2O')-0.5*energy('O2')-energy('H2')",
 weight=2.0,
 reference=-241.8,
 unit=("kJ/mol", 2625.15),
 sigma=10.0,
 metadata={
 "Source": "NIST Chemistry WebBook",
 "Description": "Hydrogen combustion (gasphase) per mol H2",
 },
)
print(data_set[-1])

---
Expression: energy('H2O')-0.5*energy('O2')-energy('H2')
Weight: 2.0
Sigma: 10.0
ReferenceValue: -241.8
Unit: kJ/mol, 2625.15
Description: Hydrogen combustion (gasphase) per mol H2
Source: NIST Chemistry WebBook



All *expressions* in a single DataSet **must be unique**:

In [5]:
try:
 data_set.add_entry("energy('H2O')-0.5*energy('O2')-energy('H2')", weight=2.0)
except Exception as e:
 print("Caught the following exception: {}".format(e))

Caught the following exception: Expression `energy('H2O')-0.5*energy('O2')-energy('H2')` already in DataSet.


The **reference values can also be numpy arrays**, for example when extracting forces or charges:

In [6]:
forces = np.array(
 [
 [0.0614444, -0.11830478, 0.03707212],
 [-0.05000567, 0.09744271, -0.03291899],
 [-0.01143873, 0.02086207, -0.00415313],
 ]
)
data_set.add_entry(
 "forces('distorted_H2O')",
 weight=1.0,
 reference=forces,
 metadata={"Source": "Calculated_with_DFT"},
)
print(data_set[-1])

---
Expression: forces('distorted_H2O')
Weight: 1.0
ReferenceValue: |
 array([[ 0.0614444 , -0.11830478, 0.03707212],
 [-0.05000567, 0.09744271, -0.03291899],
 [-0.01143873, 0.02086207, -0.00415313]])
Unit: Ha/bohr, 1.0
Source: Calculated_with_DFT



### DataSetEntry attributes

A DataSetEntry has the following attributes:

* **expression** : str
* **weight** : float or numpy array
* **unit** : 2-tuple (str, float)
* **reference** : float or numpy array
* **sigma** : float
* **jobids** : set of str (read-only). The job ids that appear in the expression.
* **extractors** : set of str (read-only). The extractors that appear in the expression.

In [7]:
print(data_set[-2].expression)
print(data_set[-2].weight)
print(data_set[-2].unit)
print(data_set[-2].reference)
print(data_set[-2].sigma)

energy('H2O')-0.5*energy('O2')-energy('H2')
2.0
('kJ/mol', 2625.15)
-241.8
10.0


In [8]:
print(data_set[-2].jobids)

{'O2', 'H2', 'H2O'}


In [9]:
print(data_set[-2].extractors)

{'energy'}


### Accessing the DataSet entries

Above, `data_set[-1]` was used to access the last added element, and `data_set[-2]` to access the second to last added element. More generally, the DataSet can be **indexed either as a** `list` **or as a** `dict`:

In [10]:
print(data_set[0].expression)
print(data_set[1].expression)
print(data_set[1].reference)
print(data_set["energy('H2O')-0.5*energy('O2')-energy('H2')"].reference)

angle('H2O', 0, 1, 2)
energy('H2O')-0.5*energy('O2')-energy('H2')
-241.8
-241.8


**Get the number of entries** in the DataSet with ``len()``:

In [11]:
print(len(data_set))

3


In [12]:
print(data_set.get("expression"))
print(data_set.keys())

["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')", "forces('distorted_H2O')"]
["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')", "forces('distorted_H2O')"]


In [13]:
print(data_set.get("weight"))
print(data_set.get("extractors"))

[0.333, 2.0, 1.0]
[{'angle'}, {'energy'}, {'forces'}]


**Loop over DataSet entries**:

In [14]:
for ds_entry in data_set:
 print(ds_entry.expression)

angle('H2O', 0, 1, 2)
energy('H2O')-0.5*energy('O2')-energy('H2')
forces('distorted_H2O')


In [15]:
for expr in data_set.get("expression"):
 print(expr)

angle('H2O', 0, 1, 2)
energy('H2O')-0.5*energy('O2')-energy('H2')
forces('distorted_H2O')


Use the **DataSet.index()** method to get the index of a DataSetEntry:

In [16]:
ds_entry = data_set["energy('H2O')-0.5*energy('O2')-energy('H2')"]
print(data_set.index(ds_entry))

1


In [17]:
print(data_set[1].expression)

energy('H2O')-0.5*energy('O2')-energy('H2')


### Delete a DataSet entry 

**Remove** an entry with ``del``:

In [18]:
data_set.add_entry("energy('some_job')", weight=1.0)
print(len(data_set))
print(data_set[-1].expression)
del data_set[-1] # or del data_set["energy('some_job')"]
print(len(data_set))
print(data_set[-1].expression)

4
energy('some_job')
3
forces('distorted_H2O')


``del`` can also be used to delete multiple entries at once, as in ``del data_set[0,2]`` to remove the first and third entries.

### Compute the intersection of two DataSets

**Intersect** two DataSets with ``&``:

In [19]:
another_data_set = DataSet()
another_data_set.header = {"info": "another data set"}
another_data_set.add_entry("energy('some_job')", weight=1.0)
intersected_data_set = data_set & another_data_set
print(len(intersected_data_set))
print(data_set.header)
print(another_data_set.header)
print(intersected_data_set.header)

0
{'dtype': 'DataSet', 'version': '2023.202'}
{'info': 'another data set', 'dtype': 'DataSet', 'version': '2023.202'}
{'dtype': 'DataSet', 'version': '2023.202'}


### Split a DataSet into subsets

**Subset from a list of given expressions**

In [20]:
subset = data_set.from_expressions(["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')"])
print(subset.keys())

["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')"]


In [21]:
expression = "angle('H2O', 0, 1, 2)"
original_sigma = data_set[expression].sigma
print("For expression {} the original sigma value is: {}".format(expression, original_sigma))

subset[expression].sigma = 1234 # this modifies the entry in the original data_set
print(data_set[expression].sigma)
print(subset[expression].sigma)

# restore the original value, this modifies the subset!
data_set[expression].sigma = original_sigma
print(data_set[expression].sigma)
print(subset[expression].sigma)

For expression angle('H2O', 0, 1, 2) the original sigma value is: 3.0
1234
1234
3.0
3.0


**To modify a subset without modifying the original DataSet**, you must create a `copy`:

In [22]:
new_subset = subset.copy()
new_subset[expression].sigma = 2345
print(new_subset[expression].sigma)
print(subset[expression].sigma)
print(data_set[expression].sigma)

2345
3.0
3.0


**Subset from a list of job ids**

In [23]:
subset = data_set.from_jobids(["H2O", "O2", "H2"])
print(subset.keys())

["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')"]


**Subset from metadata key-value pairs**

In [24]:
subset = data_set.from_metadata("Source", "NIST Chemistry WebBook")
print(subset)

---
dtype: DataSet
version: '2023.202'
---
Expression: energy('H2O')-0.5*energy('O2')-energy('H2')
Weight: 2.0
Sigma: 10.0
ReferenceValue: -241.8
Unit: kJ/mol, 2625.15
Description: Hydrogen combustion (gasphase) per mol H2
Source: NIST Chemistry WebBook
...



You can also match using **regular expressions**:

In [25]:
subset = data_set.from_metadata("Source", "^N[iI]ST\s+Che\w", regex=True)
print(subset.keys())

["energy('H2O')-0.5*energy('O2')-energy('H2')"]


**Subset from extractors**

In [26]:
subset = data_set.from_extractors("forces")
print(subset.get("expression"))

["forces('distorted_H2O')"]


A subset from multiple extractors can be generated by passing a list:

In [27]:
subset = data_set.from_extractors(["angle", "forces"])
print(subset.get("expression"))

["angle('H2O', 0, 1, 2)", "forces('distorted_H2O')"]


**Subset from atomic expressions**

In [28]:
subset = data_set.from_atomic_expressions()
print(data_set.get("expression"))
print(subset.get("expression"))

["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')", "forces('distorted_H2O')"]
["angle('H2O', 0, 1, 2)", "forces('distorted_H2O')"]


**Random subset with N entries**

In [29]:
subset = data_set.random(2, seed=314)
print(subset.keys())

["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')"]


**Split the data_set into random nonoverlapping subsets**

In [30]:
subset_list = data_set.split(2 / 3.0, 1 / 3.0, seed=314)
print(subset_list[0].keys())
print(subset_list[1].keys())

["forces('distorted_H2O')", "energy('H2O')-0.5*energy('O2')-energy('H2')"]
["angle('H2O', 0, 1, 2)"]


**Split the data_set into random nonoverlapping subsets based on the jobids of the entries**

In [31]:
mixed_dataset = DataSet()
for molecule in ["H2O", "NH3", "CH4"]:
 mixed_dataset.add_entry(f"forces('{molecule}')")
 mixed_dataset.add_entry(f"energy('{molecule}')")
 mixed_dataset.add_entry(f"angle('{molecule}', 0, 1, 2)")
 mixed_dataset.add_entry(f"angle('{molecule}', 0, 2, 1)")
subset = list(mixed_dataset.split_by_jobids(2 / 3.0, 1 / 3.0, seed=314))
print(subset[0].keys())
print(subset[1].keys())

["forces('H2O')", "energy('H2O')", "angle('H2O', 0, 1, 2)", "angle('H2O', 0, 2, 1)", "forces('NH3')", "energy('NH3')", "angle('NH3', 0, 1, 2)", "angle('NH3', 0, 2, 1)"]
["forces('CH4')", "energy('CH4')", "angle('CH4', 0, 1, 2)", "angle('CH4', 0, 2, 1)"]


### DataSet header
The header can be used to store comments about a data_set. When storing as a .yaml file, the header is printed as a separate YAML entry at the top of the file.

In [32]:
data_set.header = {"Comment": "An example data_set", "Date": "21-May-2001"}
print(data_set)

---
Comment: An example data_set
Date: 21-May-2001
dtype: DataSet
version: '2023.202'
---
Expression: angle('H2O', 0, 1, 2)
Weight: 0.333
Sigma: 3.0
Unit: degree, 1.0
---
Expression: energy('H2O')-0.5*energy('O2')-energy('H2')
Weight: 2.0
Sigma: 10.0
ReferenceValue: -241.8
Unit: kJ/mol, 2625.15
Description: Hydrogen combustion (gasphase) per mol H2
Source: NIST Chemistry WebBook
---
Expression: forces('distorted_H2O')
Weight: 1.0
ReferenceValue: |
 array([[ 0.0614444 , -0.11830478, 0.03707212],
 [-0.05000567, 0.09744271, -0.03291899],
 [-0.01143873, 0.02086207, -0.00415313]])
Unit: Ha/bohr, 1.0
Source: Calculated_with_DFT
...



### Save the data set

In [33]:
data_set.store("data_set.yaml")