{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Add an entry" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from scm.params import *\n", "import numpy as np\n", "\n", "data_set = DataSet()\n", "data_set.add_entry(\"angle('H2O', 0, 1, 2)\", weight=0.333)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To **access the last added element**, use ``data_set[-1]``" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "String representation of data_set[-1]\n", "---\n", "Expression: angle('H2O', 0, 1, 2)\n", "Weight: 0.333\n", "Unit: degree, 1.0\n", "\n", "Type: \n" ] } ], "source": [ "print(\"String representation of data_set[-1]\")\n", "print(data_set[-1])\n", "print(\"Type: {}\".format(type(data_set[-1])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also **change it after you've added it**:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "Expression: angle('H2O', 0, 1, 2)\n", "Weight: 0.333\n", "Sigma: 3.0\n", "Unit: degree, 1.0\n", "\n" ] } ], "source": [ "data_set[-1].sigma = 3.0\n", "print(data_set[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We recommend to always specify the *reference value*, the *unit*, and the *sigma* value when adding an entry, and also to specify any meaningful *metadata* about the data set entry." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "Expression: energy('H2O')-0.5*energy('O2')-energy('H2')\n", "Weight: 2.0\n", "Sigma: 10.0\n", "ReferenceValue: -241.8\n", "Unit: kJ/mol, 2625.15\n", "Description: Hydrogen combustion (gasphase) per mol H2\n", "Source: NIST Chemistry WebBook\n", "\n" ] } ], "source": [ "data_set.add_entry(\n", " \"energy('H2O')-0.5*energy('O2')-energy('H2')\",\n", " weight=2.0,\n", " reference=-241.8,\n", " unit=(\"kJ/mol\", 2625.15),\n", " sigma=10.0,\n", " metadata={\n", " \"Source\": \"NIST Chemistry WebBook\",\n", " \"Description\": \"Hydrogen combustion (gasphase) per mol H2\",\n", " },\n", ")\n", "print(data_set[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All *expressions* in a single DataSet **must be unique**:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Caught the following exception: Expression `energy('H2O')-0.5*energy('O2')-energy('H2')` already in DataSet.\n" ] } ], "source": [ "try:\n", " data_set.add_entry(\"energy('H2O')-0.5*energy('O2')-energy('H2')\", weight=2.0)\n", "except Exception as e:\n", " print(\"Caught the following exception: {}\".format(e))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **reference values can also be numpy arrays**, for example when extracting forces or charges:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "Expression: forces('distorted_H2O')\n", "Weight: 1.0\n", "ReferenceValue: |\n", " array([[ 0.0614444 , -0.11830478, 0.03707212],\n", " [-0.05000567, 0.09744271, -0.03291899],\n", " [-0.01143873, 0.02086207, -0.00415313]])\n", "Unit: Ha/bohr, 1.0\n", "Source: Calculated_with_DFT\n", "\n" ] } ], "source": [ "forces = np.array(\n", " [\n", " [0.0614444, -0.11830478, 0.03707212],\n", " [-0.05000567, 0.09744271, -0.03291899],\n", " [-0.01143873, 0.02086207, -0.00415313],\n", " ]\n", ")\n", "data_set.add_entry(\n", " \"forces('distorted_H2O')\",\n", " weight=1.0,\n", " reference=forces,\n", " metadata={\"Source\": \"Calculated_with_DFT\"},\n", ")\n", "print(data_set[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataSetEntry attributes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A DataSetEntry has the following attributes:\n", "\n", "* **expression** : str\n", "* **weight** : float or numpy array\n", "* **unit** : 2-tuple (str, float)\n", "* **reference** : float or numpy array\n", "* **sigma** : float\n", "* **jobids** : set of str (read-only). The job ids that appear in the expression.\n", "* **extractors** : set of str (read-only). The extractors that appear in the expression." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "energy('H2O')-0.5*energy('O2')-energy('H2')\n", "2.0\n", "('kJ/mol', 2625.15)\n", "-241.8\n", "10.0\n" ] } ], "source": [ "print(data_set[-2].expression)\n", "print(data_set[-2].weight)\n", "print(data_set[-2].unit)\n", "print(data_set[-2].reference)\n", "print(data_set[-2].sigma)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'O2', 'H2', 'H2O'}\n" ] } ], "source": [ "print(data_set[-2].jobids)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'energy'}\n" ] } ], "source": [ "print(data_set[-2].extractors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing the DataSet entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, `data_set[-1]` was used to access the last added element, and `data_set[-2]` to access the second to last added element. More generally, the DataSet can be **indexed either as a** `list` **or as a** `dict`:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "angle('H2O', 0, 1, 2)\n", "energy('H2O')-0.5*energy('O2')-energy('H2')\n", "-241.8\n", "-241.8\n" ] } ], "source": [ "print(data_set[0].expression)\n", "print(data_set[1].expression)\n", "print(data_set[1].reference)\n", "print(data_set[\"energy('H2O')-0.5*energy('O2')-energy('H2')\"].reference)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Get the number of entries** in the DataSet with ``len()``:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3\n" ] } ], "source": [ "print(len(data_set))" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ "**Get all of the expressions** with :meth:`get('expression') ` or :meth:`keys `:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\", \"forces('distorted_H2O')\"]\n", "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\", \"forces('distorted_H2O')\"]\n" ] } ], "source": [ "print(data_set.get(\"expression\"))\n", "print(data_set.keys())" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ "The :meth:`get ` method also works for all other `DataSetEntry` attirbutes, *e.g.*:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.333, 2.0, 1.0]\n", "[{'angle'}, {'energy'}, {'forces'}]\n" ] } ], "source": [ "print(data_set.get(\"weight\"))\n", "print(data_set.get(\"extractors\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Loop over DataSet entries**:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "angle('H2O', 0, 1, 2)\n", "energy('H2O')-0.5*energy('O2')-energy('H2')\n", "forces('distorted_H2O')\n" ] } ], "source": [ "for ds_entry in data_set:\n", " print(ds_entry.expression)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ "or using the :meth:`get ` method:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "angle('H2O', 0, 1, 2)\n", "energy('H2O')-0.5*energy('O2')-energy('H2')\n", "forces('distorted_H2O')\n" ] } ], "source": [ "for expr in data_set.get(\"expression\"):\n", " print(expr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the **DataSet.index()** method to get the index of a DataSetEntry:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n" ] } ], "source": [ "ds_entry = data_set[\"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n", "print(data_set.index(ds_entry))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "energy('H2O')-0.5*energy('O2')-energy('H2')\n" ] } ], "source": [ "print(data_set[1].expression)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Delete a DataSet entry \n", "\n", "**Remove** an entry with ``del``:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n", "energy('some_job')\n", "3\n", "forces('distorted_H2O')\n" ] } ], "source": [ "data_set.add_entry(\"energy('some_job')\", weight=1.0)\n", "print(len(data_set))\n", "print(data_set[-1].expression)\n", "del data_set[-1] # or del data_set[\"energy('some_job')\"]\n", "print(len(data_set))\n", "print(data_set[-1].expression)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``del`` can also be used to delete multiple entries at once, as in ``del data_set[0,2]`` to remove the first and third entries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compute the intersection of two DataSets\n", "\n", "**Intersect** two DataSets with ``&``:" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ ".. important::\n", "\n", " This creates a copy of the left-hand side dataset (e.g. header information) that only has entries also present in the right-hand side dataset." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "{'dtype': 'DataSet', 'version': '2023.202'}\n", "{'info': 'another data set', 'dtype': 'DataSet', 'version': '2023.202'}\n", "{'dtype': 'DataSet', 'version': '2023.202'}\n" ] } ], "source": [ "another_data_set = DataSet()\n", "another_data_set.header = {\"info\": \"another data set\"}\n", "another_data_set.add_entry(\"energy('some_job')\", weight=1.0)\n", "intersected_data_set = data_set & another_data_set\n", "print(len(intersected_data_set))\n", "print(data_set.header)\n", "print(another_data_set.header)\n", "print(intersected_data_set.header)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split a DataSet into subsets" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ "The following methods return a new :class:`DataSet`:\n", "\n", "* :meth:`split ` to get a list of nonoverlapping subsets.\n", "* :meth:`split_by_jobids ` to get a list of nonoverlapping subsets where e.g. forces and energies from the same job end up in the same subset.\n", "* :meth:`maxjobs `\n", "* :meth:`random `\n", "* :meth:`from_expressions `\n", "* :meth:`from_jobids `\n", "* :meth:`from_extractors `\n", "* :meth:`from_metadata `\n", "* :meth:`from_atomic_expressions ` to obtain the subset of all entries that have an \"atomic\" expression." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ ".. important::\n", "\n", " For all of the above methods,\n", " modifying entries in a subset will also modify the entries in the original data_set, and vice versa!\n", " If you do not want this behavior, apply the \n", " :meth:`copy ` method to the created subsets.\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Subset from a list of given expressions**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n" ] } ], "source": [ "subset = data_set.from_expressions([\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"])\n", "print(subset.keys())" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ ".. important::\n", "\n", " Modifying entries in a subset will also modify the entries in the original data_set, and vice versa!" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "For expression angle('H2O', 0, 1, 2) the original sigma value is: 3.0\n", "1234\n", "1234\n", "3.0\n", "3.0\n" ] } ], "source": [ "expression = \"angle('H2O', 0, 1, 2)\"\n", "original_sigma = data_set[expression].sigma\n", "print(\"For expression {} the original sigma value is: {}\".format(expression, original_sigma))\n", "\n", "subset[expression].sigma = 1234 # this modifies the entry in the original data_set\n", "print(data_set[expression].sigma)\n", "print(subset[expression].sigma)\n", "\n", "# restore the original value, this modifies the subset!\n", "data_set[expression].sigma = original_sigma\n", "print(data_set[expression].sigma)\n", "print(subset[expression].sigma)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**To modify a subset without modifying the original DataSet**, you must create a `copy`:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2345\n", "3.0\n", "3.0\n" ] } ], "source": [ "new_subset = subset.copy()\n", "new_subset[expression].sigma = 2345\n", "print(new_subset[expression].sigma)\n", "print(subset[expression].sigma)\n", "print(data_set[expression].sigma)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Subset from a list of job ids**" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n" ] } ], "source": [ "subset = data_set.from_jobids([\"H2O\", \"O2\", \"H2\"])\n", "print(subset.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Subset from metadata key-value pairs**" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "dtype: DataSet\n", "version: '2023.202'\n", "---\n", "Expression: energy('H2O')-0.5*energy('O2')-energy('H2')\n", "Weight: 2.0\n", "Sigma: 10.0\n", "ReferenceValue: -241.8\n", "Unit: kJ/mol, 2625.15\n", "Description: Hydrogen combustion (gasphase) per mol H2\n", "Source: NIST Chemistry WebBook\n", "...\n", "\n" ] } ], "source": [ "subset = data_set.from_metadata(\"Source\", \"NIST Chemistry WebBook\")\n", "print(subset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also match using **regular expressions**:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n" ] } ], "source": [ "subset = data_set.from_metadata(\"Source\", \"^N[iI]ST\\s+Che\\w\", regex=True)\n", "print(subset.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Subset from extractors**" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"forces('distorted_H2O')\"]\n" ] } ], "source": [ "subset = data_set.from_extractors(\"forces\")\n", "print(subset.get(\"expression\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A subset from multiple extractors can be generated by passing a list:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"angle('H2O', 0, 1, 2)\", \"forces('distorted_H2O')\"]\n" ] } ], "source": [ "subset = data_set.from_extractors([\"angle\", \"forces\"])\n", "print(subset.get(\"expression\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Subset from atomic expressions**" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\", \"forces('distorted_H2O')\"]\n", "[\"angle('H2O', 0, 1, 2)\", \"forces('distorted_H2O')\"]\n" ] } ], "source": [ "subset = data_set.from_atomic_expressions()\n", "print(data_set.get(\"expression\"))\n", "print(subset.get(\"expression\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Random subset with N entries**" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n" ] } ], "source": [ "subset = data_set.random(2, seed=314)\n", "print(subset.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Split the data_set into random nonoverlapping subsets**" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"forces('distorted_H2O')\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n", "[\"angle('H2O', 0, 1, 2)\"]\n" ] } ], "source": [ "subset_list = data_set.split(2 / 3.0, 1 / 3.0, seed=314)\n", "print(subset_list[0].keys())\n", "print(subset_list[1].keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Split the data_set into random nonoverlapping subsets based on the jobids of the entries**" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"forces('H2O')\", \"energy('H2O')\", \"angle('H2O', 0, 1, 2)\", \"angle('H2O', 0, 2, 1)\", \"forces('NH3')\", \"energy('NH3')\", \"angle('NH3', 0, 1, 2)\", \"angle('NH3', 0, 2, 1)\"]\n", "[\"forces('CH4')\", \"energy('CH4')\", \"angle('CH4', 0, 1, 2)\", \"angle('CH4', 0, 2, 1)\"]\n" ] } ], "source": [ "mixed_dataset = DataSet()\n", "for molecule in [\"H2O\", \"NH3\", \"CH4\"]:\n", " mixed_dataset.add_entry(f\"forces('{molecule}')\")\n", " mixed_dataset.add_entry(f\"energy('{molecule}')\")\n", " mixed_dataset.add_entry(f\"angle('{molecule}', 0, 1, 2)\")\n", " mixed_dataset.add_entry(f\"angle('{molecule}', 0, 2, 1)\")\n", "subset = list(mixed_dataset.split_by_jobids(2 / 3.0, 1 / 3.0, seed=314))\n", "print(subset[0].keys())\n", "print(subset[1].keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataSet header\n", "The header can be used to store comments about a data_set. When storing as a .yaml file, the header is printed as a separate YAML entry at the top of the file." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "Comment: An example data_set\n", "Date: 21-May-2001\n", "dtype: DataSet\n", "version: '2023.202'\n", "---\n", "Expression: angle('H2O', 0, 1, 2)\n", "Weight: 0.333\n", "Sigma: 3.0\n", "Unit: degree, 1.0\n", "---\n", "Expression: energy('H2O')-0.5*energy('O2')-energy('H2')\n", "Weight: 2.0\n", "Sigma: 10.0\n", "ReferenceValue: -241.8\n", "Unit: kJ/mol, 2625.15\n", "Description: Hydrogen combustion (gasphase) per mol H2\n", "Source: NIST Chemistry WebBook\n", "---\n", "Expression: forces('distorted_H2O')\n", "Weight: 1.0\n", "ReferenceValue: |\n", " array([[ 0.0614444 , -0.11830478, 0.03707212],\n", " [-0.05000567, 0.09744271, -0.03291899],\n", " [-0.01143873, 0.02086207, -0.00415313]])\n", "Unit: Ha/bohr, 1.0\n", "Source: Calculated_with_DFT\n", "...\n", "\n" ] } ], "source": [ "data_set.header = {\"Comment\": \"An example data_set\", \"Date\": \"21-May-2001\"}\n", "print(data_set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save the data set" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ "See also :ref:`DataSetLoadOrStore`" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "data_set.store(\"data_set.yaml\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "toc-showmarkdowntxt": false, "toc-showtags": false }, "nbformat": 4, "nbformat_minor": 4 }