Active Learning: Liquid Water Properties

Train an active-learning M3GNet model for liquid water and validate it against diffusion, radial distribution functions, and density from molecular-dynamics simulations.

When running active learning it’s usually a good idea to start off “simple” and make the system/structures gradually more complicated.

To train liquid water, we here

  • first train a potential at slightly above room temperature and 1.0 g/cm^3

  • continue with a second active learning loop where the density is explicitly scanned from a low to a high value

Initial imports

import scm.plams as plams
from scm.simple_active_learning import SimpleActiveLearningJob
import matplotlib.pyplot as plt
from scm.external_engines.core import interface_is_installed

assert interface_is_installed("m3gnet"), (
    "You must install the m3gnet backend before following this tutorial!"
)
plams.init()
PLAMS working folder: /path/to/plams_workdir.003

Create an initial water box

water = plams.from_smiles("O")
for at in water:
    at.properties = plams.Settings()
plams.plot_molecule(water)
image generated from notebook
box = plams.packmol(water, n_molecules=48, density=1.0)
plams.plot_molecule(box, rotation="-5x,5y,0z")
image generated from notebook

Let’s run a short MD simulation with M3GNet-UP-2022 to make the structure more realistic:

up_s = plams.Settings()
up_s.input.MLPotential.Model = "M3GNet-UP-2022"
up_s.runscript.nproc = 1  # always run AMS Driver in serial for MLPotential
up_md = plams.AMSNVTJob(
    molecule=box,
    settings=up_s,
    name="up_md",
    nsteps=1000,
    timestep=0.5,
    temperature=350,
)
up_md.run();
[22.02|09:51:37] JOB up_md STARTED
[22.02|09:51:37] JOB up_md RUNNING
[22.02|09:52:19] JOB up_md FINISHED
[22.02|09:52:20] JOB up_md SUCCESSFUL





<scm.plams.interfaces.adfsuite.ams.AMSResults at 0x7fccbdae5a90>

New structure:

starting_structure = up_md.results.get_main_molecule()
plams.plot_molecule(starting_structure, rotation="-5x,5y,0z")
image generated from notebook

Simple Active Learning setup

Reference engine settings

Here, we choose to train against ReaxFF Water-2017.ff, which gives a good water structure.

fast_ref_s = plams.Settings()
fast_ref_s.input.ReaxFF.ForceField = "Water2017.ff"
fast_ref_s.runscript.nproc = 1

slow_ref_s = plams.Settings()
slow_ref_s.input.QuantumESPRESSO.Pseudopotentials.Family = "SSSP-Efficiency"
slow_ref_s.input.QuantumESPRESSO.Pseudopotentials.Functional = "PBE"
slow_ref_s.input.QuantumESPRESSO.K_Points._h = "gamma"
slow_ref_s.input.QuantumESPRESSO.System = plams.Settings(
    input_dft="revpbe", ecutwfc=40, vdw_corr="Grimme-D3", dftd3_version=4
)

Change to slow_ref_s to train to revPBE-D3(BJ) instead:

ref_s = fast_ref_s.copy()

NVT molecular dynamics settings

nvt_md_s = plams.AMSNVTJob(
    nsteps=20000,
    timestep=0.5,
    temperature=(270, 350, 350),
    tau=100,
    thermostat="Berendsen",
).settings

ParAMS machine learning settings

ml_s = plams.Settings()
ml_s.input.ams.MachineLearning.Backend = "M3GNet"
ml_s.input.ams.MachineLearning.CommitteeSize = 1
ml_s.input.ams.MachineLearning.M3GNet.Model = "UniversalPotential"
ml_s.input.ams.MachineLearning.MaxEpochs = 200

Active learning settings

Liquid water is a “simple” homogeneous system, so we can expect the ML method to perform quite well. We therefore decrease the success criteria thresholds a bit compared to the default vvalues, to ensure that we get accurate results.

Since we will immediately continue with another active learning loop, we disable the “RerunSimulation” as we are not interested in the MD simulation per se.

al_s = plams.Settings()
al_s.input.ams.ActiveLearning.Steps.Type = "Geometric"
al_s.input.ams.ActiveLearning.Steps.Geometric.Start = 10
al_s.input.ams.ActiveLearning.Steps.Geometric.NumSteps = 8
al_s.input.ams.ActiveLearning.InitialReferenceData.Generate.M3GNetShortMD.Enabled = (
    "Yes"
)
al_s.input.ams.ActiveLearning.SuccessCriteria.Energy.Relative = 0.003
al_s.input.ams.ActiveLearning.SuccessCriteria.Forces.MaxDeviationForZeroForce = 0.35
al_s.input.ams.ActiveLearning.AtEnd.RerunSimulation = "No"

Complete job

settings = ref_s + nvt_md_s + ml_s + al_s
job = SimpleActiveLearningJob(
    settings=settings, molecule=starting_structure, name="sal"
)
print(job.get_input())
ActiveLearning
  AtEnd
    RerunSimulation False
  End
  InitialReferenceData
... output trimmed ....
         0.0000000000    11.2817834662     0.0000000000
         0.0000000000     0.0000000000    11.2817834662
  End
End

Run the simple active learning job

job.run(watch=True);
[22.02|10:12:07] JOB sal STARTED
[22.02|10:12:07] Renaming job sal to sal.002
[22.02|10:12:07] JOB sal.002 RUNNING
[22.02|10:12:09] Simple Active Learning 2024.101,  Nodes: 1, Procs: 1
[22.02|10:12:11] Composition of main system: H96O48
... output trimmed ....

[22.02|10:36:15] Active learning finished!
[22.02|10:36:15] Goodbye!
[22.02|10:36:15] JOB sal.002 FINISHED
[22.02|10:36:15] JOB sal.002 SUCCESSFUL





<scm.simple_active_learning.plams.simple_active_learning_job.SimpleActiveLearningResults at 0x7fccbe25c580>

Validate trained model by RDF and MSD

Note: You should skip this part if you trained to DFT since the reference MD calculation will take a very long time!

mol = job.results.get_main_molecule()
plams.plot_molecule(mol, rotation="-5x,5y,0z")
image generated from notebook
retrained_model_settings = (
    job.results.get_params_job().results.get_production_engine_settings()
)
retrained_model_settings.runscript.nproc = 1

Equilibration and production MD settings

eq_md_settings = plams.AMSNVTJob(
    nsteps=8000,
    timestep=0.5,
    thermostat="Berendsen",
    tau=100,
    temperature=300,
    samplingfreq=100,
).settings

prod_md_settings = plams.AMSNVTJob(
    nsteps=50000,
    timestep=0.5,
    thermostat="NHC",
    tau=200,
    temperature=300,
    samplingfreq=100,
).settings

Retrained model equilibration

retrained_model_eq_md_job = plams.AMSJob(
    settings=eq_md_settings + retrained_model_settings,
    molecule=mol,
    name="retrained_model_eq_md_dens_1",
)
retrained_model_eq_md_job.run();
[22.02|13:39:24] JOB eq_md_dens_1 STARTED
[22.02|13:39:24] Renaming job eq_md_dens_1 to eq_md_dens_1.003
[22.02|13:39:24] JOB eq_md_dens_1.003 RUNNING
[22.02|13:42:58] JOB eq_md_dens_1.003 FINISHED
[22.02|13:42:58] JOB eq_md_dens_1.003 SUCCESSFUL

Retrained model production simulation

Let’s then run a production simulation from the final structure of the above equilibration MD using both the retrained model and the reference engine:

retrained_model_prod_md_job = plams.AMSJob(
    settings=prod_md_settings + retrained_model_settings,
    name="retrained_model_prod_md_dens_1",
    molecule=retrained_model_eq_md_job.results.get_main_molecule(),
)
retrained_model_prod_md_job.run();
[22.02|13:46:22] JOB retrained_model_prod_md_dens_1 STARTED
[22.02|13:46:22] Renaming job retrained_model_prod_md_dens_1 to retrained_model_prod_md_dens_1.005
[22.02|13:46:22] JOB retrained_model_prod_md_dens_1.005 RUNNING
[22.02|14:07:30] JOB retrained_model_prod_md_dens_1.005 FINISHED
[22.02|14:07:30] JOB retrained_model_prod_md_dens_1.005 SUCCESSFUL

Reference equilibration MD

reference_eq_md_job = plams.AMSJob(
    settings=eq_md_settings + ref_s,
    molecule=mol,
    name="reference_eq_md_dens_1",
)
reference_eq_md_job.run();
[22.02|14:08:55] JOB reference_eq_md_dens_1 STARTED
[22.02|14:08:55] JOB reference_eq_md_dens_1 RUNNING
[22.02|14:09:19] JOB reference_eq_md_dens_1 FINISHED
[22.02|14:09:19] JOB reference_eq_md_dens_1 SUCCESSFUL

Reference production MD

reference_prod_md_job = plams.AMSJob(
    settings=prod_md_settings + ref_s,
    name="reference_prod_md_dens_1",
    molecule=reference_eq_md_job.results.get_main_molecule(),
)
reference_prod_md_job.run();
[22.02|14:09:19] JOB reference_prod_md_dens_1 STARTED
[22.02|14:09:19] Renaming job reference_prod_md_dens_1 to reference_prod_md_dens_1.003
[22.02|14:09:19] JOB reference_prod_md_dens_1.003 RUNNING
[22.02|14:11:59] JOB reference_prod_md_dens_1.003 FINISHED
[22.02|14:11:59] JOB reference_prod_md_dens_1.003 SUCCESSFUL

Mean squared displacement (MSD) helper functions

For a detailed explanation of the MSD and RDF jobs, see the “Molecular dynamics with Python” tutorial

def get_msd_job(job: plams.AMSJob, symbol: str = "O"):
    atom_indices = [
        i
        for i, at in enumerate(job.results.get_main_molecule(), 1)
        if at.symbol == symbol
    ]

    msd_job = plams.AMSMSDJob(
        job,
        name="msd-" + job.name,
        atom_indices=atom_indices,  # indices start with 1 for the first atom
        max_correlation_time_fs=4000,  # max correlation time must be set before running the job
        start_time_fit_fs=2000,  # start_time_fit can also be changed later in the postanalysis
    )
    msd_job.run()

    return msd_job


def plot_msd(job, start_time_fit_fs=None):
    """job: an AMSMSDJob"""
    time, msd = job.results.get_msd()
    fit_result, fit_x, fit_y = job.results.get_linear_fit(
        start_time_fit_fs=start_time_fit_fs
    )
    # the diffusion coefficient can also be calculated as fit_result.slope/6 (ang^2/fs)
    diffusion_coefficient = job.results.get_diffusion_coefficient(
        start_time_fit_fs=start_time_fit_fs
    )  # m^2/s
    plt.figure(figsize=(5, 3))
    plt.plot(time, msd, label="MSD")
    plt.plot(
        fit_x, fit_y, label="Linear fit slope={:.5f} ang^2/fs".format(fit_result.slope)
    )
    plt.legend()
    plt.xlabel("Correlation time (fs)")
    plt.ylabel("Mean square displacement (ang^2)")
    plt.title("MSD: Diffusion coefficient = {:.2e} m^2/s".format(diffusion_coefficient))

Temporarily turn off PLAMS logging

Technically, the MSD and RDF jobs are normal PLAMS jobs. However, they are very fast to run. We can turn off the PLAMS logging to keep the Jupyter notebook a bit more tidy:

plams.config.log.stdout = 0

Retrained model MSD, diffusion coefficient

retrained_model_msd_job = get_msd_job(retrained_model_prod_md_job, "O")
retrained_model_D = (
    retrained_model_msd_job.results.get_diffusion_coefficient()
)  # diffusion coefficient, m^2/s
plot_msd(retrained_model_msd_job)
image generated from notebook

Reference MSD, diffusion coefficient

reference_msd_job = get_msd_job(reference_prod_md_job, "O")
reference_D = (
    reference_msd_job.results.get_diffusion_coefficient()
)  # diffusion coefficient, m^2/s
plot_msd(reference_msd_job)
image generated from notebook

Conclusion for diffusion coefficient: In this case, the retrained model gives 2.53e-9 m^2/s and the reference method 2.62e-9 m^2/s. That is very good agreement! Note: your results may be somewhat different.

Retrained model and reference RDF

Let’s compare the calculated O-O, O-H, and H-H radial distribution functions (rdf):

def get_rdf(job, atom_indices, atom_indices_to, rmin, rmax, rstep):
    rdf = plams.AMSRDFJob(
        job,
        atom_indices=atom_indices,
        atom_indices_to=atom_indices_to,
        rmin=rmin,
        rmax=rmax,
        rstep=rstep,
    )
    rdf.run()

    return rdf.results.get_rdf()


final_frame = (
    retrained_model_prod_md_job.results.get_main_molecule()
)  # doesn't matter if retrained model or reference
O_ind = [i for i, at in enumerate(final_frame, 1) if at.symbol == "O"]
H_ind = [i for i, at in enumerate(final_frame, 1) if at.symbol == "H"]
rmax = final_frame.lattice[0][0] / 2
rstep = 0.05

O-O rdf

atom_indices, atom_indices_to = O_ind, O_ind
rmin = 2.0
pred_x, pred_y = get_rdf(
    retrained_model_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep
)
ref_x, ref_y = get_rdf(
    reference_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep
)
plt.plot(pred_x, pred_y, label="Retrained model")
plt.plot(ref_x, ref_y, label="Reference")
plt.xlabel("r (angstrom)")
plt.legend()
plt.title("O-O rdf");
image generated from notebook

O-H rdf

atom_indices, atom_indices_to = O_ind, H_ind
rmin = 1.3
pred_x, pred_y = get_rdf(
    retrained_model_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep
)
ref_x, ref_y = get_rdf(
    reference_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep
)
plt.plot(pred_x, pred_y, label="Retrained model")
plt.plot(ref_x, ref_y, label="Reference")
plt.xlabel("r (angstrom)")
plt.legend()
plt.title("O-H rdf");
image generated from notebook

H-H rdf

atom_indices, atom_indices_to = H_ind, H_ind
rmin = 1.3
pred_x, pred_y = get_rdf(
    retrained_model_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep
)
ref_x, ref_y = get_rdf(
    reference_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep
)
plt.plot(pred_x, pred_y, label="Retrained model")
plt.plot(ref_x, ref_y, label="Reference")
plt.xlabel("r (angstrom)")
plt.legend()
plt.title("H-H rdf");
image generated from notebook

Turn PLAMS logging back on

plams.config.log.stdout = 3  # default value

Density and NPT

Check the predicted vs. reference density

npt_md_s = plams.AMSNPTJob(
    nsteps=100000,
    timestep=0.5,
    thermostat="NHC",
    tau=100,
    temperature=300,
    barostat="MTK",
    barostat_tau=1000,
    equal="XYZ",
    pressure=1e5,
).settings
retrained_model_npt_job = plams.AMSJob(
    settings=npt_md_s + retrained_model_settings,
    name="retrained_model_npt",
    molecule=retrained_model_prod_md_job.results.get_main_molecule(),
)

retrained_model_npt_job.run();
[22.02|15:25:22] JOB retrained_model_npt STARTED
[22.02|15:25:22] JOB retrained_model_npt RUNNING
[22.02|16:06:01] JOB retrained_model_npt FINISHED
[22.02|16:06:01] JOB retrained_model_npt SUCCESSFUL





<scm.plams.interfaces.adfsuite.ams.AMSResults at 0x7fcc8c910670>
retrained_model_density = (
    plams.AMSNPTJob.load_external(retrained_model_npt_job.results.rkfpath())
    .results.get_equilibrated_molecule()
    .get_density()
)
print(
    f"Retrained model water density at 300 K: {retrained_model_density * 1e-3:.2f} g/cm^3"
)
Retrained model water density at 300 K: 0.95 g/cm^3
plams.config.jobmanager.hashing = None
reference_npt_job = plams.AMSJob(
    settings=npt_md_s + ref_s,
    name="reference_npt",
    molecule=reference_prod_md_job.results.get_main_molecule(),
)

reference_npt_job.run()
[22.02|18:53:54] JOB reference_npt STARTED
[22.02|18:53:54] Renaming job reference_npt to reference_npt.002
[22.02|18:53:54] JOB reference_npt.002 RUNNING
[22.02|18:58:13] JOB reference_npt.002 FINISHED
[22.02|18:58:13] JOB reference_npt.002 SUCCESSFUL





<scm.plams.interfaces.adfsuite.ams.AMSResults at 0x7fcc8d39d520>
reference_density = (
    plams.AMSNPTJob.load_external(reference_npt_job.results.rkfpath())
    .results.get_equilibrated_molecule()
    .get_density()
)
print(f"Reference model water density at 300 K: {reference_density * 1e-3:.2f} g/cm^3")
Reference model water density at 300 K: 1.01 g/cm^3

The above reference value for ReaxFF Water-2017.ff agrees exactly with the published reference value of 1.01 g/cm^3.

However, the retrained M3GNet model predicts a density of 0.95 g/cm^3. The agreement is reasonable but not excellent. This can be explained by the fact that almost all training data points were at 1.00 g/cm^3. Only a few points (from the “M3GNetShortMD” initial reference data generator) were taken at other densities.

Let’s continue the active learning while sampling more densities. There are two strategies:

  • Use an NPT simulation during the active learning

  • Scan the density during the active learning

Here, we choose the second approach in order to ensure that multiple different densities are sampled.

Initial structure for scanning density

Get the final frame from one of the previous MD simulations, and linearly scale the density to 800 kg/m^3 = 0.8 g/cm^3. This will stretch out the O-H bonds so follow up with a short UFF preoptimization.

new_structure = final_frame.copy()
new_structure.set_density(850)
new_structure = plams.preoptimize(new_structure, model="uff", maxiterations=20)
plams.plot_molecule(new_structure)
image generated from notebook

Second active learning job: scanning density

Here we set Steps.Type = “Linear” to run reference calculations every 5000 MD steps.

To get an accurate density it’s very important that the predicted energy is accurate. It is not enough to just get a good fit for the forces.

Here, we decrease the success criteria for both the energy and forces compared to default values.

nsteps = 80000
scan_density_md_s = plams.AMSMDScanDensityJob(
    molecule=new_structure,
    scan_density_upper=1.15,
    nsteps=nsteps,
    tau=100,
    thermostat="Berendsen",
    temperature=300,
).settings
# we must explicitly set the StopStep, since the AL divides the simulation into multiple segments
scan_density_md_s.input.ams.MolecularDynamics.Deformation.StopStep = nsteps
# job = SimpleActiveLearningJob.load_external(plams.config.default_jobmanager.workdir + "/sal.002")
scan_density_ml_s = ml_s.copy()
scan_density_ml_s.input.ams.MachineLearning.LoadModel = (
    job.results.get_params_results_directory()
)
scan_density_ml_s.input.ams.MachineLearning.Target.Forces.MAE = 0.02
scan_density_ml_s.input.ams.MachineLearning.MaxEpochs = 200
scan_density_al_s = plams.Settings()

scan_density_al_s.input.ams.ActiveLearning.Steps.Type = "Linear"
scan_density_al_s.input.ams.ActiveLearning.Steps.Linear.Start = 500
scan_density_al_s.input.ams.ActiveLearning.Steps.Linear.StepSize = 5000
scan_density_al_s.input.ams.ActiveLearning.InitialReferenceData.Load.FromPreviousModel = "Yes"

scan_density_al_s.input.ams.ActiveLearning.SuccessCriteria.Energy.Relative = 0.001
scan_density_al_s.input.ams.ActiveLearning.SuccessCriteria.Energy.Total = 0.002
# because we do not set Normalization, the above Energy criteria are energies per atom
# scan_density_al_s.input.ams.ActiveLearning.SuccessCriteria.Energy.Normalization =

scan_density_al_s.input.ams.ActiveLearning.SuccessCriteria.Forces.MaxDeviationForZeroForce = 0.30

scan_density_al_s.input.ams.ActiveLearning.AtEnd.RerunSimulation = "No"
scan_density_al_s.input.ams.ActiveLearning.MaxReferenceCalculationsPerAttempt = 2
scan_density_al_job = SimpleActiveLearningJob(
    name="scan_density_al",
    settings=ref_s + scan_density_md_s + scan_density_ml_s + scan_density_al_s,
    molecule=new_structure,
)
scan_density_al_job.run(watch=True);
[23.02|13:01:42] Simple Active Learning 2024.101,  Nodes: 1, Procs: 1
[23.02|13:01:44] Composition of main system: H96O48
[23.02|13:01:44] All REFERENCE calculations will be performed with the following ReaxFF engine:
[23.02|13:01:44]
Engine reaxff
... output trimmed ....



[23.02|14:07:16] Active learning finished!
[23.02|14:07:16] Goodbye!

Let’s recalculate the density again:

new_retrained_model_settings = scan_density_al_job.results.get_params_job().results.get_production_engine_settings()

new_retrained_model_npt_job = plams.AMSJob(
    settings=npt_md_s + new_retrained_model_settings,
    name="new_retrained_model_npt",
    molecule=retrained_model_prod_md_job.results.get_main_molecule(),
)

new_retrained_model_npt_job.run();
new_retrained_model_density = (
    plams.AMSNPTJob.load_external(new_retrained_model_npt_job.results.rkfpath())
    .results.get_equilibrated_molecule()
    .get_density()
)
print(
    f"New retrained model water density at 300 K: {new_retrained_model_density * 1e-3:.2f} g/cm^3"
)
New retrained model water density at 300 K: 1.02 g/cm^3

Conclusion for the density: Using active learning when scanning the densities makes sure that the predicitons are accurate for all densities. Consequently the equilibrium density is also in better agreement with the reference value of 1.01 g/cm^3.

Note that the density in general is quite difficult to fit accurately.

See also

Python Script

#!/usr/bin/env python
# coding: utf-8

# When running active learning it's usually a good idea to start off "simple" and make the system/structures gradually more complicated.
#
# To train liquid water, we here
#
# * first train a potential at slightly above room temperature and 1.0 g/cm^3
#
# * continue with a second active learning loop where the density is explicitly scanned from a low to a high value
#
# ## Initial imports

import scm.plams as plams
from scm.simple_active_learning import SimpleActiveLearningJob
import matplotlib.pyplot as plt
from scm.external_engines.core import interface_is_installed

assert interface_is_installed("m3gnet"), "You must install the m3gnet backend before following this tutorial!"


plams.init()


# ## Create an initial water box

water = plams.from_smiles("O")
for at in water:
    at.properties = plams.Settings()
plams.plot_molecule(water)


box = plams.packmol(water, n_molecules=48, density=1.0)
plams.plot_molecule(box, rotation="-5x,5y,0z")


# Let's run a short MD simulation with M3GNet-UP-2022 to make the structure more realistic:

up_s = plams.Settings()
up_s.input.MLPotential.Model = "M3GNet-UP-2022"
up_s.runscript.nproc = 1  # always run AMS Driver in serial for MLPotential
up_md = plams.AMSNVTJob(
    molecule=box,
    settings=up_s,
    name="up_md",
    nsteps=1000,
    timestep=0.5,
    temperature=350,
)
up_md.run()
# New structure:

starting_structure = up_md.results.get_main_molecule()
plams.plot_molecule(starting_structure, rotation="-5x,5y,0z")


# ## Simple Active Learning setup
#
# ### Reference engine settings
#
# Here, we choose to train against ReaxFF Water-2017.ff, which gives a good water structure.

fast_ref_s = plams.Settings()
fast_ref_s.input.ReaxFF.ForceField = "Water2017.ff"
fast_ref_s.runscript.nproc = 1

slow_ref_s = plams.Settings()
slow_ref_s.input.QuantumESPRESSO.Pseudopotentials.Family = "SSSP-Efficiency"
slow_ref_s.input.QuantumESPRESSO.Pseudopotentials.Functional = "PBE"
slow_ref_s.input.QuantumESPRESSO.K_Points._h = "gamma"
slow_ref_s.input.QuantumESPRESSO.System = plams.Settings(
    input_dft="revpbe", ecutwfc=40, vdw_corr="Grimme-D3", dftd3_version=4
)


# Change to slow_ref_s to train to revPBE-D3(BJ) instead:

ref_s = fast_ref_s.copy()


# ### NVT molecular dynamics settings

nvt_md_s = plams.AMSNVTJob(
    nsteps=20000,
    timestep=0.5,
    temperature=(270, 350, 350),
    tau=100,
    thermostat="Berendsen",
).settings


# ### ParAMS machine learning settings

ml_s = plams.Settings()
ml_s.input.ams.MachineLearning.Backend = "M3GNet"
ml_s.input.ams.MachineLearning.CommitteeSize = 1
ml_s.input.ams.MachineLearning.M3GNet.Model = "UniversalPotential"
ml_s.input.ams.MachineLearning.MaxEpochs = 200


# ### Active learning settings
#
# Liquid water is a "simple" homogeneous system, so we can expect the ML method to perform quite well. We therefore decrease the success criteria thresholds a bit compared to the default vvalues, to ensure that we get accurate results.
#
# Since we will immediately continue with another active learning loop, we disable the "RerunSimulation" as we are not interested in the MD simulation per se.

al_s = plams.Settings()
al_s.input.ams.ActiveLearning.Steps.Type = "Geometric"
al_s.input.ams.ActiveLearning.Steps.Geometric.Start = 10
al_s.input.ams.ActiveLearning.Steps.Geometric.NumSteps = 8
al_s.input.ams.ActiveLearning.InitialReferenceData.Generate.M3GNetShortMD.Enabled = "Yes"
al_s.input.ams.ActiveLearning.SuccessCriteria.Energy.Relative = 0.003
al_s.input.ams.ActiveLearning.SuccessCriteria.Forces.MaxDeviationForZeroForce = 0.35
al_s.input.ams.ActiveLearning.AtEnd.RerunSimulation = "No"


# ### Complete job

settings = ref_s + nvt_md_s + ml_s + al_s
job = SimpleActiveLearningJob(settings=settings, molecule=starting_structure, name="sal")
print(job.get_input())


# ### Run the simple active learning job

job.run(watch=True)
# ## Validate trained model by RDF and MSD
#
# Note: You should skip this part if you trained to DFT since the reference MD calculation will take a very long time!

mol = job.results.get_main_molecule()
plams.plot_molecule(mol, rotation="-5x,5y,0z")


retrained_model_settings = job.results.get_params_job().results.get_production_engine_settings()
retrained_model_settings.runscript.nproc = 1


# ### Equilibration and production MD settings

eq_md_settings = plams.AMSNVTJob(
    nsteps=8000,
    timestep=0.5,
    thermostat="Berendsen",
    tau=100,
    temperature=300,
    samplingfreq=100,
).settings

prod_md_settings = plams.AMSNVTJob(
    nsteps=50000,
    timestep=0.5,
    thermostat="NHC",
    tau=200,
    temperature=300,
    samplingfreq=100,
).settings


# ### Retrained model equilibration

retrained_model_eq_md_job = plams.AMSJob(
    settings=eq_md_settings + retrained_model_settings,
    molecule=mol,
    name="retrained_model_eq_md_dens_1",
)
retrained_model_eq_md_job.run()
# ### Retrained model production simulation

# Let's then run a production simulation from the final structure of the above equilibration MD using both the retrained model and the reference engine:

retrained_model_prod_md_job = plams.AMSJob(
    settings=prod_md_settings + retrained_model_settings,
    name="retrained_model_prod_md_dens_1",
    molecule=retrained_model_eq_md_job.results.get_main_molecule(),
)
retrained_model_prod_md_job.run()
# ### Reference equilibration MD

reference_eq_md_job = plams.AMSJob(
    settings=eq_md_settings + ref_s,
    molecule=mol,
    name="reference_eq_md_dens_1",
)
reference_eq_md_job.run()
# ### Reference production MD

reference_prod_md_job = plams.AMSJob(
    settings=prod_md_settings + ref_s,
    name="reference_prod_md_dens_1",
    molecule=reference_eq_md_job.results.get_main_molecule(),
)
reference_prod_md_job.run()


# ### Mean squared displacement (MSD) helper functions

# For a detailed explanation of the MSD and RDF jobs, see the "Molecular dynamics with Python" tutorial


def get_msd_job(job: plams.AMSJob, symbol: str = "O"):
    atom_indices = [i for i, at in enumerate(job.results.get_main_molecule(), 1) if at.symbol == symbol]

    msd_job = plams.AMSMSDJob(
        job,
        name="msd-" + job.name,
        atom_indices=atom_indices,  # indices start with 1 for the first atom
        max_correlation_time_fs=4000,  # max correlation time must be set before running the job
        start_time_fit_fs=2000,  # start_time_fit can also be changed later in the postanalysis
    )
    msd_job.run()

    return msd_job


def plot_msd(job, start_time_fit_fs=None):
    """job: an AMSMSDJob"""
    time, msd = job.results.get_msd()
    fit_result, fit_x, fit_y = job.results.get_linear_fit(start_time_fit_fs=start_time_fit_fs)
    # the diffusion coefficient can also be calculated as fit_result.slope/6 (ang^2/fs)
    diffusion_coefficient = job.results.get_diffusion_coefficient(start_time_fit_fs=start_time_fit_fs)  # m^2/s
    plt.figure(figsize=(5, 3))
    plt.plot(time, msd, label="MSD")
    plt.plot(fit_x, fit_y, label="Linear fit slope={:.5f} ang^2/fs".format(fit_result.slope))
    plt.legend()
    plt.xlabel("Correlation time (fs)")
    plt.ylabel("Mean square displacement (ang^2)")
    plt.title("MSD: Diffusion coefficient = {:.2e} m^2/s".format(diffusion_coefficient))


plt.gcf().savefig("picture1.png")
plt.clf()


# ### Temporarily turn off PLAMS logging
#
# Technically, the MSD and RDF jobs are normal PLAMS jobs. However, they are very fast to run. We can turn off the PLAMS logging to keep the Jupyter notebook a bit more tidy:

plams.config.log.stdout = 0


# ### Retrained model MSD, diffusion coefficient

retrained_model_msd_job = get_msd_job(retrained_model_prod_md_job, "O")
retrained_model_D = retrained_model_msd_job.results.get_diffusion_coefficient()  # diffusion coefficient, m^2/s
plot_msd(retrained_model_msd_job)


# ### Reference MSD, diffusion coefficient

reference_msd_job = get_msd_job(reference_prod_md_job, "O")
reference_D = reference_msd_job.results.get_diffusion_coefficient()  # diffusion coefficient, m^2/s
plot_msd(reference_msd_job)


# **Conclusion for diffusion coefficient**: In this case, the retrained model gives 2.53e-9 m^2/s and the reference method 2.62e-9 m^2/s. That is very good agreement! Note: your results may be somewhat different.

# ### Retrained model and reference RDF
#
# Let's compare the calculated O-O, O-H, and H-H radial distribution functions (rdf):


def get_rdf(job, atom_indices, atom_indices_to, rmin, rmax, rstep):
    rdf = plams.AMSRDFJob(
        job,
        atom_indices=atom_indices,
        atom_indices_to=atom_indices_to,
        rmin=rmin,
        rmax=rmax,
        rstep=rstep,
    )
    rdf.run()

    return rdf.results.get_rdf()


final_frame = retrained_model_prod_md_job.results.get_main_molecule()  # doesn't matter if retrained model or reference
O_ind = [i for i, at in enumerate(final_frame, 1) if at.symbol == "O"]
H_ind = [i for i, at in enumerate(final_frame, 1) if at.symbol == "H"]
rmax = final_frame.lattice[0][0] / 2
rstep = 0.05


# ### O-O rdf

atom_indices, atom_indices_to = O_ind, O_ind
rmin = 2.0
pred_x, pred_y = get_rdf(retrained_model_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep)
ref_x, ref_y = get_rdf(reference_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep)
plt.plot(pred_x, pred_y, label="Retrained model")
plt.plot(ref_x, ref_y, label="Reference")
plt.xlabel("r (angstrom)")
plt.legend()
plt.title("O-O rdf")
plt.gcf().savefig("picture2.png")
plt.clf()


# ### O-H rdf

atom_indices, atom_indices_to = O_ind, H_ind
rmin = 1.3
pred_x, pred_y = get_rdf(retrained_model_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep)
ref_x, ref_y = get_rdf(reference_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep)
plt.plot(pred_x, pred_y, label="Retrained model")
plt.plot(ref_x, ref_y, label="Reference")
plt.xlabel("r (angstrom)")
plt.legend()
plt.title("O-H rdf")
plt.gcf().savefig("picture3.png")
plt.clf()


# ### H-H rdf

atom_indices, atom_indices_to = H_ind, H_ind
rmin = 1.3
pred_x, pred_y = get_rdf(retrained_model_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep)
ref_x, ref_y = get_rdf(reference_prod_md_job, atom_indices, atom_indices_to, rmin, rmax, rstep)
plt.plot(pred_x, pred_y, label="Retrained model")
plt.plot(ref_x, ref_y, label="Reference")
plt.xlabel("r (angstrom)")
plt.legend()
plt.title("H-H rdf")
plt.gcf().savefig("picture4.png")
plt.clf()


# ### Turn PLAMS logging back on

plams.config.log.stdout = 3  # default value


# ## Density and NPT
#
# ### Check the predicted vs. reference density

npt_md_s = plams.AMSNPTJob(
    nsteps=100000,
    timestep=0.5,
    thermostat="NHC",
    tau=100,
    temperature=300,
    barostat="MTK",
    barostat_tau=1000,
    equal="XYZ",
    pressure=1e5,
).settings


retrained_model_npt_job = plams.AMSJob(
    settings=npt_md_s + retrained_model_settings,
    name="retrained_model_npt",
    molecule=retrained_model_prod_md_job.results.get_main_molecule(),
)

retrained_model_npt_job.run()
retrained_model_density = (
    plams.AMSNPTJob.load_external(retrained_model_npt_job.results.rkfpath())
    .results.get_equilibrated_molecule()
    .get_density()
)
print(f"Retrained model water density at 300 K: {retrained_model_density * 1e-3:.2f} g/cm^3")


plams.config.jobmanager.hashing = None
reference_npt_job = plams.AMSJob(
    settings=npt_md_s + ref_s,
    name="reference_npt",
    molecule=reference_prod_md_job.results.get_main_molecule(),
)

reference_npt_job.run()


reference_density = (
    plams.AMSNPTJob.load_external(reference_npt_job.results.rkfpath()).results.get_equilibrated_molecule().get_density()
)
print(f"Reference model water density at 300 K: {reference_density * 1e-3:.2f} g/cm^3")


# The above reference value for ReaxFF Water-2017.ff agrees exactly with the published reference value of 1.01 g/cm^3.
#
# However, the retrained M3GNet model predicts a density of 0.95 g/cm^3. The agreement is reasonable but not excellent. This can be explained by the fact that almost all training data points were at 1.00 g/cm^3. Only a few points (from the "M3GNetShortMD" initial reference data generator) were taken at other densities.
#
# Let's continue the active learning while sampling more densities. There are two strategies:
#
# * Use an NPT simulation during the active learning
# * Scan the density during the active learning
#
# Here, we choose the second approach in order to ensure that multiple different densities are sampled.

# ### Initial structure for scanning density
#
# Get the final frame from one of the previous MD simulations, and linearly scale the density to 800 kg/m^3 = 0.8 g/cm^3. This will stretch out the O-H bonds so follow up with a short UFF preoptimization.

new_structure = final_frame.copy()
new_structure.set_density(850)
new_structure = plams.preoptimize(new_structure, model="uff", maxiterations=20)


plams.plot_molecule(new_structure)


# ### Second active learning job: scanning density
#
# Here we set Steps.Type = "Linear" to run reference calculations every 5000 MD steps.
#
# To get an accurate density it's very important that the predicted energy is accurate. It is not enough to just get a good fit for the forces.
#
# Here, we decrease the success criteria for both the energy and forces compared to default values.

nsteps = 80000
scan_density_md_s = plams.AMSMDScanDensityJob(
    molecule=new_structure,
    scan_density_upper=1.15,
    nsteps=nsteps,
    tau=100,
    thermostat="Berendsen",
    temperature=300,
).settings
# we must explicitly set the StopStep, since the AL divides the simulation into multiple segments
scan_density_md_s.input.ams.MolecularDynamics.Deformation.StopStep = nsteps


# job = SimpleActiveLearningJob.load_external(plams.config.default_jobmanager.workdir + "/sal.002")
scan_density_ml_s = ml_s.copy()
scan_density_ml_s.input.ams.MachineLearning.LoadModel = job.results.get_params_results_directory()
scan_density_ml_s.input.ams.MachineLearning.Target.Forces.MAE = 0.02
scan_density_ml_s.input.ams.MachineLearning.MaxEpochs = 200


scan_density_al_s = plams.Settings()

scan_density_al_s.input.ams.ActiveLearning.Steps.Type = "Linear"
scan_density_al_s.input.ams.ActiveLearning.Steps.Linear.Start = 500
scan_density_al_s.input.ams.ActiveLearning.Steps.Linear.StepSize = 5000
scan_density_al_s.input.ams.ActiveLearning.InitialReferenceData.Load.FromPreviousModel = "Yes"

scan_density_al_s.input.ams.ActiveLearning.SuccessCriteria.Energy.Relative = 0.001
scan_density_al_s.input.ams.ActiveLearning.SuccessCriteria.Energy.Total = 0.002
# because we do not set Normalization, the above Energy criteria are energies per atom
# scan_density_al_s.input.ams.ActiveLearning.SuccessCriteria.Energy.Normalization =

scan_density_al_s.input.ams.ActiveLearning.SuccessCriteria.Forces.MaxDeviationForZeroForce = 0.30

scan_density_al_s.input.ams.ActiveLearning.AtEnd.RerunSimulation = "No"
scan_density_al_s.input.ams.ActiveLearning.MaxReferenceCalculationsPerAttempt = 2


scan_density_al_job = SimpleActiveLearningJob(
    name="scan_density_al",
    settings=ref_s + scan_density_md_s + scan_density_ml_s + scan_density_al_s,
    molecule=new_structure,
)
scan_density_al_job.run(watch=True)
# Let's recalculate the density again:

new_retrained_model_settings = scan_density_al_job.results.get_params_job().results.get_production_engine_settings()

new_retrained_model_npt_job = plams.AMSJob(
    settings=npt_md_s + new_retrained_model_settings,
    name="new_retrained_model_npt",
    molecule=retrained_model_prod_md_job.results.get_main_molecule(),
)

new_retrained_model_npt_job.run()
new_retrained_model_density = (
    plams.AMSNPTJob.load_external(new_retrained_model_npt_job.results.rkfpath())
    .results.get_equilibrated_molecule()
    .get_density()
)
print(f"New retrained model water density at 300 K: {new_retrained_model_density * 1e-3:.2f} g/cm^3")


# **Conclusion for the density**: Using active learning when scanning the densities makes sure that the predicitons are accurate for all densities. Consequently the equilibrium density is also in better agreement with the reference value of 1.01 g/cm^3.
#
# Note that the density in general is quite difficult to fit accurately.