{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add an entry"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scm.params import *\n",
    "import numpy as np\n",
    "\n",
    "data_set = DataSet()\n",
    "data_set.add_entry(\"angle('H2O', 0, 1, 2)\", weight=0.333)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To **access the last added element**, use ``data_set[-1]``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "String representation of data_set[-1]\n",
      "---\n",
      "Expression: angle('H2O', 0, 1, 2)\n",
      "Weight: 0.333\n",
      "Unit: degree, 1.0\n",
      "\n",
      "Type: <class 'scm.params.core.dataset.DataSetEntry'>\n"
     ]
    }
   ],
   "source": [
    "print(\"String representation of data_set[-1]\")\n",
    "print(data_set[-1])\n",
    "print(\"Type: {}\".format(type(data_set[-1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also **change it after you've added it**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "Expression: angle('H2O', 0, 1, 2)\n",
      "Weight: 0.333\n",
      "Sigma: 3.0\n",
      "Unit: degree, 1.0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data_set[-1].sigma = 3.0\n",
    "print(data_set[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We recommend to always specify the *reference value*, the *unit*, and the *sigma* value when adding an entry, and also to specify any meaningful *metadata* about the data set entry."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "Expression: energy('H2O')-0.5*energy('O2')-energy('H2')\n",
      "Weight: 2.0\n",
      "Sigma: 10.0\n",
      "ReferenceValue: -241.8\n",
      "Unit: kJ/mol, 2625.15\n",
      "Description: Hydrogen combustion (gasphase) per mol H2\n",
      "Source: NIST Chemistry WebBook\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data_set.add_entry(\n",
    "    \"energy('H2O')-0.5*energy('O2')-energy('H2')\",\n",
    "    weight=2.0,\n",
    "    reference=-241.8,\n",
    "    unit=(\"kJ/mol\", 2625.15),\n",
    "    sigma=10.0,\n",
    "    metadata={\n",
    "        \"Source\": \"NIST Chemistry WebBook\",\n",
    "        \"Description\": \"Hydrogen combustion (gasphase) per mol H2\",\n",
    "    },\n",
    ")\n",
    "print(data_set[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All *expressions* in a single DataSet **must be unique**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Caught the following exception: Expression `energy('H2O')-0.5*energy('O2')-energy('H2')` already in DataSet.\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    data_set.add_entry(\"energy('H2O')-0.5*energy('O2')-energy('H2')\", weight=2.0)\n",
    "except Exception as e:\n",
    "    print(\"Caught the following exception: {}\".format(e))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **reference values can also be numpy arrays**, for example when extracting forces or charges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "Expression: forces('distorted_H2O')\n",
      "Weight: 1.0\n",
      "ReferenceValue: |\n",
      "   array([[ 0.0614444 , -0.11830478,  0.03707212],\n",
      "          [-0.05000567,  0.09744271, -0.03291899],\n",
      "          [-0.01143873,  0.02086207, -0.00415313]])\n",
      "Unit: Ha/bohr, 1.0\n",
      "Source: Calculated_with_DFT\n",
      "\n"
     ]
    }
   ],
   "source": [
    "forces = np.array(\n",
    "    [\n",
    "        [0.0614444, -0.11830478, 0.03707212],\n",
    "        [-0.05000567, 0.09744271, -0.03291899],\n",
    "        [-0.01143873, 0.02086207, -0.00415313],\n",
    "    ]\n",
    ")\n",
    "data_set.add_entry(\n",
    "    \"forces('distorted_H2O')\",\n",
    "    weight=1.0,\n",
    "    reference=forces,\n",
    "    metadata={\"Source\": \"Calculated_with_DFT\"},\n",
    ")\n",
    "print(data_set[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataSetEntry attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A DataSetEntry has the following attributes:\n",
    "\n",
    "* **expression** : str\n",
    "* **weight** : float or numpy array\n",
    "* **unit** : 2-tuple (str, float)\n",
    "* **reference** : float or numpy array\n",
    "* **sigma** : float\n",
    "* **jobids** : set of str (read-only). The job ids that appear in the expression.\n",
    "* **extractors** : set of str (read-only). The extractors that appear in the expression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "energy('H2O')-0.5*energy('O2')-energy('H2')\n",
      "2.0\n",
      "('kJ/mol', 2625.15)\n",
      "-241.8\n",
      "10.0\n"
     ]
    }
   ],
   "source": [
    "print(data_set[-2].expression)\n",
    "print(data_set[-2].weight)\n",
    "print(data_set[-2].unit)\n",
    "print(data_set[-2].reference)\n",
    "print(data_set[-2].sigma)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'O2', 'H2', 'H2O'}\n"
     ]
    }
   ],
   "source": [
    "print(data_set[-2].jobids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'energy'}\n"
     ]
    }
   ],
   "source": [
    "print(data_set[-2].extractors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accessing the DataSet entries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, `data_set[-1]` was used to access the last added element, and `data_set[-2]` to access the second to last added element. More generally, the DataSet can be **indexed either as a** `list` **or as a** `dict`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "angle('H2O', 0, 1, 2)\n",
      "energy('H2O')-0.5*energy('O2')-energy('H2')\n",
      "-241.8\n",
      "-241.8\n"
     ]
    }
   ],
   "source": [
    "print(data_set[0].expression)\n",
    "print(data_set[1].expression)\n",
    "print(data_set[1].reference)\n",
    "print(data_set[\"energy('H2O')-0.5*energy('O2')-energy('H2')\"].reference)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Get the number of entries** in the DataSet with ``len()``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n"
     ]
    }
   ],
   "source": [
    "print(len(data_set))"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    "**Get all of the expressions** with :meth:`get('expression') <scm.params.core.data_set.DataSet.get>` or :meth:`keys <scm.params.core.data_set.DataSet.keys>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\", \"forces('distorted_H2O')\"]\n",
      "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\", \"forces('distorted_H2O')\"]\n"
     ]
    }
   ],
   "source": [
    "print(data_set.get(\"expression\"))\n",
    "print(data_set.keys())"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    "The :meth:`get <scm.params.core.data_set.DataSet.get>` method also works for all other `DataSetEntry` attirbutes, *e.g.*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.333, 2.0, 1.0]\n",
      "[{'angle'}, {'energy'}, {'forces'}]\n"
     ]
    }
   ],
   "source": [
    "print(data_set.get(\"weight\"))\n",
    "print(data_set.get(\"extractors\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Loop over DataSet entries**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "angle('H2O', 0, 1, 2)\n",
      "energy('H2O')-0.5*energy('O2')-energy('H2')\n",
      "forces('distorted_H2O')\n"
     ]
    }
   ],
   "source": [
    "for ds_entry in data_set:\n",
    "    print(ds_entry.expression)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    "or using the :meth:`get <scm.params.core.data_set.DataSet.get>` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "angle('H2O', 0, 1, 2)\n",
      "energy('H2O')-0.5*energy('O2')-energy('H2')\n",
      "forces('distorted_H2O')\n"
     ]
    }
   ],
   "source": [
    "for expr in data_set.get(\"expression\"):\n",
    "    print(expr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the **DataSet.index()** method to get the index of a DataSetEntry:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "ds_entry = data_set[\"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n",
    "print(data_set.index(ds_entry))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "energy('H2O')-0.5*energy('O2')-energy('H2')\n"
     ]
    }
   ],
   "source": [
    "print(data_set[1].expression)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete a DataSet entry \n",
    "\n",
    "**Remove** an entry with ``del``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4\n",
      "energy('some_job')\n",
      "3\n",
      "forces('distorted_H2O')\n"
     ]
    }
   ],
   "source": [
    "data_set.add_entry(\"energy('some_job')\", weight=1.0)\n",
    "print(len(data_set))\n",
    "print(data_set[-1].expression)\n",
    "del data_set[-1]  # or del data_set[\"energy('some_job')\"]\n",
    "print(len(data_set))\n",
    "print(data_set[-1].expression)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "``del`` can also be used to delete multiple entries at once, as in ``del data_set[0,2]`` to remove the first and third entries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute the intersection of two DataSets\n",
    "\n",
    "**Intersect** two DataSets with ``&``:"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. important::\n",
    "\n",
    "    This creates a copy of the left-hand side dataset (e.g. header information) that only has entries also present in the right-hand side dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "{'dtype': 'DataSet', 'version': '2023.202'}\n",
      "{'info': 'another data set', 'dtype': 'DataSet', 'version': '2023.202'}\n",
      "{'dtype': 'DataSet', 'version': '2023.202'}\n"
     ]
    }
   ],
   "source": [
    "another_data_set = DataSet()\n",
    "another_data_set.header = {\"info\": \"another data set\"}\n",
    "another_data_set.add_entry(\"energy('some_job')\", weight=1.0)\n",
    "intersected_data_set = data_set & another_data_set\n",
    "print(len(intersected_data_set))\n",
    "print(data_set.header)\n",
    "print(another_data_set.header)\n",
    "print(intersected_data_set.header)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split a DataSet into subsets"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    "The following methods return a new :class:`DataSet`:\n",
    "\n",
    "* :meth:`split <scm.params.core.data_set.DataSet.split>` to get a list of nonoverlapping subsets.\n",
    "* :meth:`split_by_jobids <scm.params.core.data_set.DataSet.split_by_jobids>` to get a list of nonoverlapping subsets where e.g. forces and energies from the same job end up in the same subset.\n",
    "* :meth:`maxjobs <scm.params.core.data_set.DataSet.maxjobs>`\n",
    "* :meth:`random <scm.params.core.data_set.DataSet.random>`\n",
    "* :meth:`from_expressions <scm.params.core.data_set.DataSet.from_expressions>`\n",
    "* :meth:`from_jobids <scm.params.core.data_set.DataSet.from_jobids>`\n",
    "* :meth:`from_extractors <scm.params.core.data_set.DataSet.from_extractors>`\n",
    "* :meth:`from_metadata <scm.params.core.data_set.DataSet.from_metadata>`\n",
    "* :meth:`from_atomic_expressions <scm.params.core.data_set.DataSet.from_atomic_expressions>` to obtain the subset of all entries that have an \"atomic\" expression."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. important::\n",
    "\n",
    "    For all of the above methods,\n",
    "    modifying entries in a subset will also modify the entries in the original data_set, and vice versa!\n",
    "    If you do not want this behavior, apply the \n",
    "    :meth:`copy <scm.params.core.data_set.DataSet.copy>` method to the created subsets.\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Subset from a list of given expressions**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n"
     ]
    }
   ],
   "source": [
    "subset = data_set.from_expressions([\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"])\n",
    "print(subset.keys())"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    ".. important::\n",
    "\n",
    "    Modifying entries in a subset will also modify the entries in the original data_set, and vice versa!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "For expression angle('H2O', 0, 1, 2) the original sigma value is: 3.0\n",
      "1234\n",
      "1234\n",
      "3.0\n",
      "3.0\n"
     ]
    }
   ],
   "source": [
    "expression = \"angle('H2O', 0, 1, 2)\"\n",
    "original_sigma = data_set[expression].sigma\n",
    "print(\"For expression {} the original sigma value is: {}\".format(expression, original_sigma))\n",
    "\n",
    "subset[expression].sigma = 1234  # this modifies the entry in the original data_set\n",
    "print(data_set[expression].sigma)\n",
    "print(subset[expression].sigma)\n",
    "\n",
    "# restore the original value, this modifies the subset!\n",
    "data_set[expression].sigma = original_sigma\n",
    "print(data_set[expression].sigma)\n",
    "print(subset[expression].sigma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**To modify a subset without modifying the original DataSet**, you must create a `copy`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2345\n",
      "3.0\n",
      "3.0\n"
     ]
    }
   ],
   "source": [
    "new_subset = subset.copy()\n",
    "new_subset[expression].sigma = 2345\n",
    "print(new_subset[expression].sigma)\n",
    "print(subset[expression].sigma)\n",
    "print(data_set[expression].sigma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Subset from a list of job ids**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n"
     ]
    }
   ],
   "source": [
    "subset = data_set.from_jobids([\"H2O\", \"O2\", \"H2\"])\n",
    "print(subset.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Subset from metadata key-value pairs**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "dtype: DataSet\n",
      "version: '2023.202'\n",
      "---\n",
      "Expression: energy('H2O')-0.5*energy('O2')-energy('H2')\n",
      "Weight: 2.0\n",
      "Sigma: 10.0\n",
      "ReferenceValue: -241.8\n",
      "Unit: kJ/mol, 2625.15\n",
      "Description: Hydrogen combustion (gasphase) per mol H2\n",
      "Source: NIST Chemistry WebBook\n",
      "...\n",
      "\n"
     ]
    }
   ],
   "source": [
    "subset = data_set.from_metadata(\"Source\", \"NIST Chemistry WebBook\")\n",
    "print(subset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also match using **regular expressions**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n"
     ]
    }
   ],
   "source": [
    "subset = data_set.from_metadata(\"Source\", \"^N[iI]ST\\s+Che\\w\", regex=True)\n",
    "print(subset.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Subset from extractors**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"forces('distorted_H2O')\"]\n"
     ]
    }
   ],
   "source": [
    "subset = data_set.from_extractors(\"forces\")\n",
    "print(subset.get(\"expression\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A subset from multiple extractors can be generated by passing a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"angle('H2O', 0, 1, 2)\", \"forces('distorted_H2O')\"]\n"
     ]
    }
   ],
   "source": [
    "subset = data_set.from_extractors([\"angle\", \"forces\"])\n",
    "print(subset.get(\"expression\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Subset from atomic expressions**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\", \"forces('distorted_H2O')\"]\n",
      "[\"angle('H2O', 0, 1, 2)\", \"forces('distorted_H2O')\"]\n"
     ]
    }
   ],
   "source": [
    "subset = data_set.from_atomic_expressions()\n",
    "print(data_set.get(\"expression\"))\n",
    "print(subset.get(\"expression\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Random subset with N entries**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"angle('H2O', 0, 1, 2)\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n"
     ]
    }
   ],
   "source": [
    "subset = data_set.random(2, seed=314)\n",
    "print(subset.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Split the data_set into random nonoverlapping subsets**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"forces('distorted_H2O')\", \"energy('H2O')-0.5*energy('O2')-energy('H2')\"]\n",
      "[\"angle('H2O', 0, 1, 2)\"]\n"
     ]
    }
   ],
   "source": [
    "subset_list = data_set.split(2 / 3.0, 1 / 3.0, seed=314)\n",
    "print(subset_list[0].keys())\n",
    "print(subset_list[1].keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Split the data_set into random nonoverlapping subsets based on the jobids of the entries**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\"forces('H2O')\", \"energy('H2O')\", \"angle('H2O', 0, 1, 2)\", \"angle('H2O', 0, 2, 1)\", \"forces('NH3')\", \"energy('NH3')\", \"angle('NH3', 0, 1, 2)\", \"angle('NH3', 0, 2, 1)\"]\n",
      "[\"forces('CH4')\", \"energy('CH4')\", \"angle('CH4', 0, 1, 2)\", \"angle('CH4', 0, 2, 1)\"]\n"
     ]
    }
   ],
   "source": [
    "mixed_dataset = DataSet()\n",
    "for molecule in [\"H2O\", \"NH3\", \"CH4\"]:\n",
    "    mixed_dataset.add_entry(f\"forces('{molecule}')\")\n",
    "    mixed_dataset.add_entry(f\"energy('{molecule}')\")\n",
    "    mixed_dataset.add_entry(f\"angle('{molecule}', 0, 1, 2)\")\n",
    "    mixed_dataset.add_entry(f\"angle('{molecule}', 0, 2, 1)\")\n",
    "subset = list(mixed_dataset.split_by_jobids(2 / 3.0, 1 / 3.0, seed=314))\n",
    "print(subset[0].keys())\n",
    "print(subset[1].keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataSet header\n",
    "The header can be used to store comments about a data_set. When storing as a .yaml file, the header is printed as a separate YAML entry at the top of the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "Comment: An example data_set\n",
      "Date: 21-May-2001\n",
      "dtype: DataSet\n",
      "version: '2023.202'\n",
      "---\n",
      "Expression: angle('H2O', 0, 1, 2)\n",
      "Weight: 0.333\n",
      "Sigma: 3.0\n",
      "Unit: degree, 1.0\n",
      "---\n",
      "Expression: energy('H2O')-0.5*energy('O2')-energy('H2')\n",
      "Weight: 2.0\n",
      "Sigma: 10.0\n",
      "ReferenceValue: -241.8\n",
      "Unit: kJ/mol, 2625.15\n",
      "Description: Hydrogen combustion (gasphase) per mol H2\n",
      "Source: NIST Chemistry WebBook\n",
      "---\n",
      "Expression: forces('distorted_H2O')\n",
      "Weight: 1.0\n",
      "ReferenceValue: |\n",
      "   array([[ 0.0614444 , -0.11830478,  0.03707212],\n",
      "          [-0.05000567,  0.09744271, -0.03291899],\n",
      "          [-0.01143873,  0.02086207, -0.00415313]])\n",
      "Unit: Ha/bohr, 1.0\n",
      "Source: Calculated_with_DFT\n",
      "...\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data_set.header = {\"Comment\": \"An example data_set\", \"Date\": \"21-May-2001\"}\n",
    "print(data_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save the data set"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "tags": []
   },
   "source": [
    "See also :ref:`DataSetLoadOrStore`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set.store(\"data_set.yaml\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "toc-showmarkdowntxt": false,
  "toc-showtags": false
 },
 "nbformat": 4,
 "nbformat_minor": 4
}