Benchmarking ML Potentials for Hydrocarbon Isomerization Energies

Benchmarking Hydrocarbon Isomerization Energies

This benchmark aims to evaluate the accuracy of a range of machine learning potentials included in AMS, using the ISOC7 dataset from the paper: A.Karton and E.Semidalas, “Benchmarking Isomerization Energies for C5–C7 Hydrocarbons: The ISOC7 Database,” J. of Comp. Chem. 47(2) (2026): e70296, https://doi.org/10.1002/jcc.70296.

The ISOC7 dataset contains a diverse range of hydrocarbons from simple alkanes to highly strained systems featuring tetrahedrane units, as well as alkanes, alkenes, alkynes, and cyclic compounds. It has reference data to CCSD(T)/CBS level of accuracy and therefore provides a good benchmark for evaluating the accuracy of ML potentials for small organic systems.

Extract Reference ISOC7 Database

We first extract the ISOC7 database:

import os
import tarfile
from pathlib import Path

isoc7_tar = Path(os.environ["AMSHOME"]) / "atomicdata" / "Benchmarks" / "ISOC7.tar.gz"
isoc7 = Path("./ISOC7")

if not isoc7.exists():
    with tarfile.open(isoc7_tar, "r:gz") as tar:
        tar.extractall(path=Path.cwd())

Load XYZ Data

Then we load all the xyz files as PLAMS molecules:

import scm.plams as plams

molecules = plams.read_molecules(isoc7, formats=["xyz"])

We can see the number of loaded structures:

print(f"Number of loaded structures: {len(molecules)}")
Number of loaded structures: 1308

And view a selection of them, with their corresponding unique id:

imgs = {
    k: plams.view(v, width=300, height=300, guess_bonds=True)
    for k, v in list(molecules.items())[:4]
}

plams.plot_image_grid(imgs, rows=1);
../_images/benchmark_mlpot_isoc7_11_0_0339f147.png

Load Reference Data

Next we load the reference CCSD(T)/CBS energies from the benchmark:

import pandas as pd

reference_data = pd.read_csv(
    isoc7 / "ISOC7_W1-F12_Ref_Values.csv",
    usecols=[
        "reaction_name",
        "species1",
        "species2",
        "W1-F12_isomerization_energy (CCSD(T)/CBS, kcal/mol)",
    ],
)

reference_data
reaction_name species1 species2 W1-F12_isomerization_energy (CCSD(T)/CBS, kcal/mol)
0 ISO_1 C5H6_20 C5H6_1 91.439
1 ISO_2 C5H6_20 C5H6_2 90.146
2 ISO_3 C5H6_20 C5H6_3 69.361
3 ISO_4 C5H6_20 C5H6_4 61.970
4 ISO_5 C5H6_20 C5H6_5 56.739
... ... ... ... ...
1292 ISO_1293 C7H14_60 C7H14_55 8.228
1293 ISO_1294 C7H14_60 C7H14_56 6.718
1294 ISO_1295 C7H14_60 C7H14_57 5.015
1295 ISO_1296 C7H14_60 C7H14_58 4.639
1296 ISO_1297 C7H14_60 C7H14_59 4.495

1297 rows × 4 columns

Set up Data Sets with ParAMS

To run our benchmark, we are going to user ParAMS. First we set up the JobCollection. This contains the information on the input structures.

import scm.params as params

results_importer = params.ResultsImporter(settings={"units": {"energy": "kcal/mol"}})
for k, v in molecules.items():
    jce = params.JCEntry()
    jce.settings.input.AMS.Task = "SinglePoint"
    jce.molecule = v
    jce.metadata["Source"] = "https://doi.org/10.1002/jcc.70296"
    results_importer.job_collection.add_entry(k, jce)
print(results_importer.job_collection.from_jobids(["C5H10_1"]))
---
dtype: JobCollection
version: '2026.101'
---
ID: 'C5H10_1'
ReferenceEngineID: None
AMSInput: |
   Task SinglePoint
   System
     Atoms
                 C      -2.0912640000       0.3020780000       0.0337330000
                 H      -1.8884470000       1.3092600000      -0.3363480000
                 H      -2.9862710000      -0.0691500000      -0.4691390000
                 H      -2.3168350000       0.3813300000       1.0998540000
                 C      -0.8950030000      -0.6171310000      -0.2050890000
                 H      -1.1307500000      -1.6274460000       0.1449640000
                 H      -0.7058150000      -0.7029420000      -1.2800580000
                 C       0.3616680000      -0.1305570000       0.4747630000
                 H       0.2826960000      -0.0502440000       1.5542310000
                 C       1.7140390000      -0.5126740000      -0.0538700000
                 H       1.7487570000      -1.1235900000      -0.9469660000
                 H       2.5092940000      -0.7118750000       0.6517470000
                 C       1.2230000000       0.9054910000      -0.1923400000
                 H       1.6860140000       1.6700050000       0.4167400000
                 H       0.9267080000       1.2414110000      -1.1782100000
     End
   End
Source: https://doi.org/10.1002/jcc.70296
...

Next we set up the dataset containing the information on each isomerization reaction and the reference energy:

ha_to_kcal_mol = plams.Units.conversion_ratio("Ha", "kcal/mol")

for _, (s1, s2, ref_e) in reference_data[
    ["species1", "species2", "W1-F12_isomerization_energy (CCSD(T)/CBS, kcal/mol)"]
].iterrows():
    results_importer.data_set.add_entry(
        f"energy('{s2}')-energy('{s1}')",
        reference=ref_e,
        unit=("kcal/mol", ha_to_kcal_mol),
        metadata={
            "Source": "https://doi.org/10.1002/jcc.70296",
            "reference_method": "CCSD(T)/CBS",
        },
    )
print(results_importer.data_set.from_jobids(["C5H6_1", "C5H6_20"]))
---
dtype: DataSet
version: '2026.101'
---
Expression: energy('C5H6_1')-energy('C5H6_20')
Weight: 1.0
Sigma: 0.002
ReferenceValue: 91.439
Unit: kcal/mol, 627.5094737775374
Source: https://doi.org/10.1002/jcc.70296
reference_method: CCSD(T)/CBS
...

We can also visualize one of these isomerization reactions, for example the first one with a high isomerization energy:

imgs = {
    k: plams.view(molecules[k], width=300, height=300, guess_bonds=True)
    for k in ["C5H6_20", "C5H6_1"]
}

plams.plot_image_grid(imgs, rows=1);
../_images/benchmark_mlpot_isoc7_24_0_e286b7fd.png

Now we store the yaml files needed to run the ParAMS calculations:

results_importer.store("./isoc7_benchmark")
['./isoc7_benchmark/job_collection.yaml',
 './isoc7_benchmark/results_importer_settings.yaml',
 './isoc7_benchmark/training_set.yaml']

Note that the dataset can also be viewed using the ParAMS GUI:

# !params -gui ./isoc7_benchmark/job_collection.yaml

Run Benchmarks with ParAMS

Now we set up a benchmark job for each ML model of interest, and run them. We include all models suitable for small organic molecules, as well as some trained on materials/other domains for comparison.

models = [
    "AIMNet2-B973c",
    "AIMNet2-wB97MD3",
    "ANI-1ccx",
    "ANI-1x",
    "ANI-2x",
    "eSEN-S-Con-OMol",
    "M3GNet-UP-2022",
    "MACE-MPA-0",
    "UMA-S-1.2-ODAC",
    "UMA-S-1.2-OMat",
    "UMA-S-1.2-OMol",
]
jobs = {}

for model in models:
    job = params.ParAMSJob.from_yaml("./isoc7_benchmark", name=f"benchmark_{model}")
    job.settings.runscript.nproc = 1
    job.settings.runscript.preamble_lines = ["export OMP_NUM_THREADS=1"]
    job.settings.input.parallellevels.jobs = 1
    job.settings.input.mlpotential.model = model
    job.settings.input.task = "SinglePoint"
    jobs[model] = job
for job in jobs.values():
    job.run()
    job.ok()
[09.04|11:58:22] JOB benchmark_AIMNet2-B973c STARTED
[09.04|11:58:22] JOB benchmark_AIMNet2-B973c RUNNING
[09.04|11:58:51] JOB benchmark_AIMNet2-B973c FINISHED
[09.04|11:58:51] JOB benchmark_AIMNet2-B973c SUCCESSFUL
[09.04|11:58:51] JOB benchmark_AIMNet2-wB97MD3 STARTED
[09.04|11:58:51] JOB benchmark_AIMNet2-wB97MD3 RUNNING
[09.04|11:59:19] JOB benchmark_AIMNet2-wB97MD3 FINISHED
[09.04|11:59:19] JOB benchmark_AIMNet2-wB97MD3 SUCCESSFUL
[09.04|11:59:19] JOB benchmark_ANI-1ccx STARTED
[09.04|11:59:19] JOB benchmark_ANI-1ccx RUNNING
[09.04|11:59:32] JOB benchmark_ANI-1ccx FINISHED
[09.04|11:59:32] JOB benchmark_ANI-1ccx SUCCESSFUL
[09.04|11:59:32] JOB benchmark_ANI-1x STARTED
[09.04|11:59:32] JOB benchmark_ANI-1x RUNNING
[09.04|11:59:45] JOB benchmark_ANI-1x FINISHED
[09.04|11:59:45] JOB benchmark_ANI-1x SUCCESSFUL
[09.04|11:59:45] JOB benchmark_ANI-2x STARTED
... output trimmed ....
[09.04|12:02:20] JOB benchmark_M3GNet-UP-2022 SUCCESSFUL
[09.04|12:02:20] JOB benchmark_MACE-MPA-0 STARTED
[09.04|12:02:20] JOB benchmark_MACE-MPA-0 RUNNING
[09.04|12:03:42] JOB benchmark_MACE-MPA-0 FINISHED
[09.04|12:03:42] JOB benchmark_MACE-MPA-0 SUCCESSFUL
[09.04|12:03:42] JOB benchmark_UMA-S-1.2-ODAC STARTED
[09.04|12:03:42] JOB benchmark_UMA-S-1.2-ODAC RUNNING
[09.04|12:06:45] JOB benchmark_UMA-S-1.2-ODAC FINISHED
[09.04|12:06:45] JOB benchmark_UMA-S-1.2-ODAC SUCCESSFUL
[09.04|12:06:45] JOB benchmark_UMA-S-1.2-OMat STARTED
[09.04|12:06:45] JOB benchmark_UMA-S-1.2-OMat RUNNING
[09.04|12:09:40] JOB benchmark_UMA-S-1.2-OMat FINISHED
[09.04|12:09:40] JOB benchmark_UMA-S-1.2-OMat SUCCESSFUL
[09.04|12:09:40] JOB benchmark_UMA-S-1.2-OMol STARTED
[09.04|12:09:40] JOB benchmark_UMA-S-1.2-OMol RUNNING
[09.04|12:12:46] JOB benchmark_UMA-S-1.2-OMol FINISHED
[09.04|12:12:46] JOB benchmark_UMA-S-1.2-OMol SUCCESSFUL

Each completed job now has results for the errors between the model calculations and the reference. For example, with AIMNet2-wB97MD3:

print("\n".join(str(jobs["AIMNet2-wB97MD3"].results.get_data_set_evaluator()).split("\n")[:20]))
Group/Expression            N        MAE          RMSE*        Unit*          Weight     Loss*        Contribution[%]
---------------------------------------------------------------------------------------------------------------------
Total                       1297     1.28848      +1.60312     kcal/mol       1297.000   833315383.808 100.00     Total

   energy                   1297     1.28848      +1.60312     kcal/mol       1297.000   833315383.808 100.00     Extractor
      C7H8_51-C7H8_320      1        7.03500      -7.03500     kcal/mol       1.000      12372796.117 1.48       Expression
      C7H8_3-C7H8_320       1        6.84835      -6.84835     kcal/mol       1.000      11724976.294 1.41       Expression
      C6H6_2-C6H6_56        1        6.80974      -6.80974     kcal/mol       1.000      11593141.526 1.39       Expression
      C7H8_117-C7H8_320     1        6.56349      -6.56349     kcal/mol       1.000      10769839.645 1.29       Expression
      C7H8_79-C7H8_320      1        6.54360      -6.54360     kcal/mol       1.000      10704671.280 1.28       Expression
      C6H6_21-C6H6_56       1        6.35571      -6.35571     kcal/mol       1.000      10098752.565 1.21       Expression
      C7H8_10-C7H8_320      1        6.00151      -6.00151     kcal/mol       1.000      9004523.140  1.08       Expression
      C7H8_256-C7H8_320     1        5.79461      -5.79461     kcal/mol       1.000      8394388.810  1.01       Expression
      C7H8_9-C7H8_320       1        5.52882      -5.52882     kcal/mol       1.000      7641966.352  0.92       Expression
      C7H8_66-C7H8_320      1        5.14151      -5.14151     kcal/mol       1.000      6608776.360  0.79       Expression
      C7H8_4-C7H8_320       1        5.13745      -5.13745     kcal/mol       1.000      6598337.286  0.79       Expression
      C7H8_5-C7H8_320       1        4.50864      -4.50864     kcal/mol       1.000      5081948.943  0.61       Expression
      C7H8_44-C7H8_320      1        4.41894      -4.41894     kcal/mol       1.000      4881765.891  0.59       Expression
      C7H8_266-C7H8_320     1        4.04698      -4.04698     kcal/mol       1.000      4094507.700  0.49       Expression
      C7H10_35-C7H10_396    1        3.99482      +3.99482     kcal/mol       1.000      3989653.283  0.48       Expression

Results can alternatively be displayed in a correlation plot of the reference vs. predicted energies:

ax = jobs["AIMNet2-wB97MD3"].results.plot_simple_correlation("energy", validation_set=False)
ax;
../_images/benchmark_mlpot_isoc7_37_0_3914d105.png

Note that all these can also again be viewed in the ParAMS GUI:

# !params -gui ./plams_workdir/benchmark_AIMNet2-wB97MD3/results

Evaluating Benchmark Results

To evaluate the results, we extract the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) from the job results:

results = pd.DataFrame({"Model": models})
results["MAE"] = [jobs[m].results.get_data_set_evaluator().results.mae for m in models]
results["RMSE"] = [jobs[m].results.get_data_set_evaluator().results.rmse for m in models]
results.sort_values(by="MAE", inplace=True)
results.rename(
    columns={
        "MAE": "MAE<sup>*</sup>",
        "RMSE": "RMSE<sup>*</sup>",
    },
    inplace=True,
)
results.style.hide(axis="index").background_gradient(
    subset=["MAE<sup>*</sup>", "RMSE<sup>*</sup>"], cmap="Blues", vmin=0, vmax=10
).format("{:.2f}", subset=["MAE<sup>*</sup>", "RMSE<sup>*</sup>"]).set_caption(
    "<br><span style='text-align:left;font-size:18px'>Benchmark Results for <a href='https://doi.org/10.1002/jcc.70296' target='_blank'>ISOC7 Dataset</a> Isomerization Energies</span>"
    "<br><span style='font-size:14px'><sup>*</sup> Reference method CCSD(T)/CBS, in kcal/mol</span>"
).set_table_styles(
    [
        {"selector": "th", "props": [("border-bottom", "2px solid black")]},
        {"selector": "th, td", "props": [("min-width", "100px")]},
        {"selector": "td", "props": [("border-bottom", "1px solid lightgray")]},
        {"selector": "table", "props": [("border-collapse", "collapse")]},
    ]
)

Benchmark Results for ISOC7 Dataset Isomerization Energies
* Reference method CCSD(T)/CBS, in kcal/mol
Model MAE* RMSE*
AIMNet2-wB97MD3 1.29 1.60
eSEN-S-Con-OMol 2.49 2.89
UMA-S-1.2-OMat 2.88 3.62
ANI-1ccx 3.08 5.17
AIMNet2-B973c 4.16 4.96
UMA-S-1.2-ODAC 4.92 6.23
ANI-1x 5.15 6.65
ANI-2x 5.75 7.29
M3GNet-UP-2022 7.82 10.09
UMA-S-1.2-OMol 8.46 10.54
MACE-MPA-0 13.46 15.47

From these results, we see that AIMNet2-wB97MD3 performs best, achieving the lowest errors (MAE = 1.29 kcal/mol, RMSE = 1.60 kcal/mol), clearly outperforming all other models. In a broader context, this is on a par with the best performing DFT functionals, at a fraction of the computational cost.

Other models targeted at small molecules also perform well, including eSEN-S-Con-OMol and ANI-1ccx all with MAE < 5 kcal/mol.

We find that, as expected, the worst performers in this case are models trained primarily on materials data such as M3GNet-UP-2022 and MACE-MPA-0, which are better suited to other chemical domains.

One surprising result is for UMA-S-1.2-OMol, which would be expected to perform better for these systems than other UMA variants. We can compare and contrast the correlation plots for UMA-S-1.2-OMol with the previous plot for the best performing model, AIMNet2-wB97MD3:

ax = jobs["UMA-S-1.2-OMol"].results.plot_simple_correlation("energy", validation_set=False)
ax;
../_images/benchmark_mlpot_isoc7_48_0_9fd1c507.png

This shows there are some structures with a very high energy difference with respect to the reference. We can inspect these structures:

print("\n".join(str(jobs["UMA-S-1.2-OMol"].results.get_data_set_evaluator()).split("\n")[:9]))
Group/Expression            N        MAE          RMSE*        Unit*          Weight     Loss*        Contribution[%]
---------------------------------------------------------------------------------------------------------------------
Total                       1297     8.45743      +10.54307    kcal/mol       1297.000   36042430408.064 100.00     Total

   energy                   1297     8.45743      +10.54307    kcal/mol       1297.000   36042430408.064 100.00     Extractor
      C6H6_2-C6H6_56        1        51.50648     -51.50648    kcal/mol       1.000      663229344.845 1.84       Expression
      C7H8_3-C7H8_320       1        43.75316     -43.75316    kcal/mol       1.000      478584670.414 1.33       Expression
      C7H8_5-C7H8_320       1        41.02305     -41.02305    kcal/mol       1.000      420722696.516 1.17       Expression
      C6H6_21-C6H6_56       1        40.44702     -40.44702    kcal/mol       1.000      408990409.659 1.13       Expression
imgs = {
    k: plams.view(molecules[k], width=300, height=300, guess_bonds=True)
    for k in ["C6H6_2", "C7H8_3", "C7H8_5", "C6H6_21"]
}

plams.plot_image_grid(imgs, rows=1);
../_images/benchmark_mlpot_isoc7_51_0_ab28216f.png

As noted in the original paper, we find many of the systems with a high deviations compared to the reference values contain a tetrahedrane core, which is also a common issue with many of the tested DFT functionals.

See also

Python Script

#!/usr/bin/env python
# coding: utf-8

# ## Benchmarking Hydrocarbon Isomerization Energies

# This benchmark aims to evaluate the accuracy of a range of machine learning potentials included in AMS, using the ISOC7 dataset from the paper: A.Karton and E.Semidalas, “Benchmarking Isomerization Energies for C5–C7 Hydrocarbons: The ISOC7 Database,” J. of Comp. Chem. 47(2) (2026): e70296, https://doi.org/10.1002/jcc.70296.
#
# The ISOC7 dataset contains a diverse range of hydrocarbons from simple alkanes to highly strained systems featuring tetrahedrane units, as well as alkanes, alkenes, alkynes, and cyclic compounds. It has reference data to CCSD(T)/CBS level of accuracy and therefore provides a good benchmark for evaluating the accuracy of ML potentials for small organic systems.

# ### Extract Reference ISOC7 Database

# We first extract the ISOC7 database:

import os
import tarfile
from pathlib import Path

isoc7_tar = Path(os.environ["AMSHOME"]) / "atomicdata" / "Benchmarks" / "ISOC7.tar.gz"
isoc7 = Path("./ISOC7")

if not isoc7.exists():
    with tarfile.open(isoc7_tar, "r:gz") as tar:
        tar.extractall(path=Path.cwd())


# ### Load XYZ Data

# Then we load all the xyz files as PLAMS molecules:

import scm.plams as plams

molecules = plams.read_molecules(isoc7, formats=["xyz"])


# We can see the number of loaded structures:

print(f"Number of loaded structures: {len(molecules)}")


# And view a selection of them, with their corresponding unique id:

imgs = {k: plams.view(v, width=300, height=300, guess_bonds=True) for k, v in list(molecules.items())[:4]}

plams.plot_image_grid(imgs, rows=1, save_path="picture1.png")
# ### Load Reference Data

# Next we load the reference CCSD(T)/CBS energies from the benchmark:

import pandas as pd

reference_data = pd.read_csv(
    isoc7 / "ISOC7_W1-F12_Ref_Values.csv",
    usecols=[
        "reaction_name",
        "species1",
        "species2",
        "W1-F12_isomerization_energy (CCSD(T)/CBS, kcal/mol)",
    ],
)

reference_data


# ### Set up Data Sets with ParAMS

# To run our benchmark, we are going to user ParAMS. First we set up the JobCollection. This contains the information on the input structures.

import scm.params as params

results_importer = params.ResultsImporter(settings={"units": {"energy": "kcal/mol"}})


for k, v in molecules.items():
    jce = params.JCEntry()
    jce.settings.input.AMS.Task = "SinglePoint"
    jce.molecule = v
    jce.metadata["Source"] = "https://doi.org/10.1002/jcc.70296"
    results_importer.job_collection.add_entry(k, jce)


print(results_importer.job_collection.from_jobids(["C5H10_1"]))


# Next we set up the dataset containing the information on each isomerization reaction and the reference energy:

ha_to_kcal_mol = plams.Units.conversion_ratio("Ha", "kcal/mol")

for _, (s1, s2, ref_e) in reference_data[
    ["species1", "species2", "W1-F12_isomerization_energy (CCSD(T)/CBS, kcal/mol)"]
].iterrows():
    results_importer.data_set.add_entry(
        f"energy('{s2}')-energy('{s1}')",
        reference=ref_e,
        unit=("kcal/mol", ha_to_kcal_mol),
        metadata={
            "Source": "https://doi.org/10.1002/jcc.70296",
            "reference_method": "CCSD(T)/CBS",
        },
    )


print(results_importer.data_set.from_jobids(["C5H6_1", "C5H6_20"]))


# We can also visualize one of these isomerization reactions, for example the first one with a high isomerization energy:

imgs = {k: plams.view(molecules[k], width=300, height=300, guess_bonds=True) for k in ["C5H6_20", "C5H6_1"]}

plams.plot_image_grid(imgs, rows=1, save_path="picture2.png")
# Now we store the yaml files needed to run the ParAMS calculations:

results_importer.store("./isoc7_benchmark")


# Note that the dataset can also be viewed using the ParAMS GUI:

# !params -gui ./isoc7_benchmark/job_collection.yaml


# ### Run Benchmarks with ParAMS

# Now we set up a benchmark job for each ML model of interest, and run them. We include all models suitable for small organic molecules, as well as some trained on materials/other domains for comparison.

models = [
    "AIMNet2-B973c",
    "AIMNet2-wB97MD3",
    "ANI-1ccx",
    "ANI-1x",
    "ANI-2x",
    "eSEN-S-Con-OMol",
    "M3GNet-UP-2022",
    "MACE-MPA-0",
    "UMA-S-1.2-ODAC",
    "UMA-S-1.2-OMat",
    "UMA-S-1.2-OMol",
]


jobs = {}

for model in models:
    job = params.ParAMSJob.from_yaml("./isoc7_benchmark", name=f"benchmark_{model}")
    job.settings.runscript.nproc = 1
    job.settings.runscript.preamble_lines = ["export OMP_NUM_THREADS=1"]
    job.settings.input.parallellevels.jobs = 1
    job.settings.input.mlpotential.model = model
    job.settings.input.task = "SinglePoint"
    jobs[model] = job


for job in jobs.values():
    job.run()
    job.ok()


# Each completed job now has results for the errors between the model calculations and the reference. For example, with AIMNet2-wB97MD3:

print("\n".join(str(jobs["AIMNet2-wB97MD3"].results.get_data_set_evaluator()).split("\n")[:20]))


# Results can alternatively be displayed in a correlation plot of the reference vs. predicted energies:

ax = jobs["AIMNet2-wB97MD3"].results.plot_simple_correlation("energy", validation_set=False)
ax
ax.figure.savefig("picture3.png")


# Note that all these can also again be viewed in the ParAMS GUI:

# !params -gui ./plams_workdir/benchmark_AIMNet2-wB97MD3/results


# ## Evaluating Benchmark Results

# To evaluate the results, we extract the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) from the job results:

results = pd.DataFrame({"Model": models})


results["MAE"] = [jobs[m].results.get_data_set_evaluator().results.mae for m in models]
results["RMSE"] = [jobs[m].results.get_data_set_evaluator().results.rmse for m in models]


results.sort_values(by="MAE", inplace=True)


results.rename(
    columns={
        "MAE": "MAE<sup>*</sup>",
        "RMSE": "RMSE<sup>*</sup>",
    },
    inplace=True,
)


results.style.hide(axis="index").background_gradient(
    subset=["MAE<sup>*</sup>", "RMSE<sup>*</sup>"], cmap="Blues", vmin=0, vmax=10
).format("{:.2f}", subset=["MAE<sup>*</sup>", "RMSE<sup>*</sup>"]).set_caption(
    "<br><span style='text-align:left;font-size:18px'>Benchmark Results for <a href='https://doi.org/10.1002/jcc.70296' target='_blank'>ISOC7 Dataset</a> Isomerization Energies</span>"
    "<br><span style='font-size:14px'><sup>*</sup> Reference method CCSD(T)/CBS, in kcal/mol</span>"
).set_table_styles(
    [
        {"selector": "th", "props": [("border-bottom", "2px solid black")]},
        {"selector": "th, td", "props": [("min-width", "100px")]},
        {"selector": "td", "props": [("border-bottom", "1px solid lightgray")]},
        {"selector": "table", "props": [("border-collapse", "collapse")]},
    ]
)


# From these results, we see that `AIMNet2-wB97MD3` performs best, achieving the lowest errors (MAE = 1.29 kcal/mol, RMSE = 1.60 kcal/mol), clearly outperforming all other models. In a broader context, this is on a par with the best performing DFT functionals, at a fraction of the computational cost.
#
# Other models targeted at small molecules also perform well, including `eSEN-S-Con-OMol` and `ANI-1ccx` all with MAE < 5 kcal/mol.
#
# We find that, as expected, the worst performers in this case are models trained primarily on materials data such as `M3GNet-UP-2022` and `MACE-MPA-0`, which are better suited to other chemical domains.
#
# One surprising result is for `UMA-S-1.2-OMol`, which would be expected to perform better for these systems than other UMA variants. We can compare and contrast the correlation plots for `UMA-S-1.2-OMol` with the previous plot for the best performing model, `AIMNet2-wB97MD3`:

ax = jobs["UMA-S-1.2-OMol"].results.plot_simple_correlation("energy", validation_set=False)
ax
ax.figure.savefig("picture4.png")


# This shows there are some structures with a very high energy difference with respect to the reference. We can inspect these structures:

print("\n".join(str(jobs["UMA-S-1.2-OMol"].results.get_data_set_evaluator()).split("\n")[:9]))


imgs = {
    k: plams.view(molecules[k], width=300, height=300, guess_bonds=True)
    for k in ["C6H6_2", "C7H8_3", "C7H8_5", "C6H6_21"]
}

plams.plot_image_grid(imgs, rows=1, save_path="picture5.png")


# As noted in the original paper, we find many of the systems with a high deviations compared to the reference values contain a tetrahedrane core, which is also a common issue with many of the tested DFT functionals.