Comparison of systems

Simple methods for comparing two ChemicalSystem instances:

ChemicalSystem.has_same_atoms(other: ChemicalSystem, attributes: List[Literal['gui', 'adf', 'band', 'forcefield', 'dftb', 'reaxff', 'qe', 'vasp']] = []) bool
ChemicalSystem.has_same_atoms(other: ChemicalSystem, comp: atom_comparator_func_t) bool

Checks if two systems have identical atoms in the same order.

Atoms are compared one by one using either a user-defined comparator function, or the Atom.has_identical_attributes method with a user defined list of (optional) attributes groups to consider in the comparison.

ChemicalSystem.has_same_coords(other: ChemicalSystem, tol: float = 0.001, unit: str = 'angstrom') bool

Checks if the atomic coordinates of two systems are within a threshold of each other.

This check is intended for systems that have the same number of atoms and all atoms in the same order. (You likely want to call has_same_atoms before calling this method.) The threshold tol is compared against the distance between two corresponding atoms. This ensures that the return value of this method does not depend on an overall rotation of the two systems.

ChemicalSystem.has_same_geometry(other: ChemicalSystem, tol: float = 0.001, unit: str = 'angstrom') bool

Checks if the atomic coordinates and lattice vectors of two systems are within a threshold of each other. This is just a shorthand for calling has_same_coords and lattice.is_close() on the two systems.

ChemicalSystem.has_same_regions(other: ChemicalSystem) bool

Checks if two systems have identical regions, meaning region names match and each region includes the same atoms.

This only checks the indices of the atoms assigned to the different regions. It does not check if atoms with the same index are actually the same. Use has_same_atoms for that.

ChemicalSystem.has_same_selection(other: ChemicalSystem, consider_selection_order: bool = False) bool

Checks if two systems have the same atom selection.

This only checks the indices of the selected atoms. It does not check if atoms with the same index are actually the same. Use has_same_atoms for that.

By default the selection order is ignored in the check and the two selections are compared as a set. This can be changed via the consider_selection_order argument.

Atomic and Molecular hashing (fingerprints)

The ChemicalSystem provides molecular hashing (fingerprinting) capabilities that generate unique integer identifiers for molecules and individual atoms based on their chemical structure and environment. These hashes enable molecule comparison, detection of chemically equivalent atoms, and comparison of atomic chemical environments.

The hashes are invariant to atom ordering and molecular rotation/translation, but capture the essential chemical topology.

Note

Bonding information is required for hashing to work. Call guess_bonds() first if bonds aren’t defined!

Why use molecular hashing?

  • Molecule comparison / identification: Efficiently determine if two molecules are identical or different. The same molecule produces the same hash regardless of atom ordering or atomic coordinates (unless stereochemistry options are enabled, see below).

  • Chemical equivalence detection: Identify atoms in symmetric or equivalent chemical environments

Basic Examples

Example 1: Hash invariance to atom ordering and geometry

The molecular hash is the same regardless of how atoms are ordered or how the molecule is positioned in space. This can be useful if you want to check if two molecules with different geometry correspond to the “same system”, even if you cannot assume that the atoms are in the same order.

from scm.base import ChemicalSystem
import random

# Create a simple molecule: ethanol
mol_a = ChemicalSystem.from_smiles("OCC")
mol_a.guess_bonds()
print(f"{'Hash of original ethanol':<40}: {mol_a.hash()}")

# Let's make a copy and modify it a bit
mol_b = mol_a.copy()
shuffle_indices = list(range(mol_b.num_atoms))
random.shuffle(shuffle_indices)
mol_b.reorder_atoms(shuffle_indices)
mol_b.perturb_coordinates(0.1)  # randomly perturb coordinates
print(f"{'Hash of shuffled and perturbed ethanol':<40}: {mol_b.hash()}")

# But remove an atom or break a bond, and the hash changes:
mol_c = mol_a.copy()
mol_c.bonds.remove_bond(0)
print(f"{'Hash ethanol with a bond removed':<40}: {mol_c.hash()}")

Example 2: Using atomic hashes to identify chemical environments

Atomic hashes reveal which atoms are in equivalent chemical environments. The depth parameter controls how much of the surrounding structure is considered. The various optional parameters determine which features will be considered.

from scm.base import ChemicalSystem

# Molecule with two methyl groups and various hydrogens
mol = ChemicalSystem(
    """
    System
        Atoms
            C -4.0419  0.5780 -0.4490  # 0   Methyl group 1
            H -3.4703  0.0193 -1.2206  # 1
            H -5.0855  0.7047 -0.8071  # 2
            H -4.0585 -0.0070  0.4950  # 3
            C -1.1373  3.1240  0.2318  # 4   Methyl group 2
            H -0.1113  2.9084  0.5999  # 5
            H -1.6136  3.8633  0.9089  # 6
            H -1.0664  3.5552 -0.7887  # 7
            C -3.4113  1.9507 -0.2140  # 8   Middle C
            H -3.9782  2.4815  0.5805  # 9
            H -3.4937  2.5415 -1.1513  # 10
            C -1.9597  1.8726  0.1904  # 11  C connected to O
            O -1.4448  0.8029  0.4844  # 12
        End
    End
    """
)
mol.guess_bonds()

In the following picture we show the indices of the atoms.

../_images/chemsys_hash_mol_608480e1.png

Now let’s examine how the atomic hashes change using different depths;

# ==== Full molecular context (depth=-1, the default) ====
hashes_full = mol.atomic_hashes(depth=-1)

# Hydrogens within the same methyl group are equivalent
assert hashes_full[1] == hashes_full[2] == hashes_full[3]  # Methyl 1 H's
assert hashes_full[5] == hashes_full[6] == hashes_full[7]  # Methyl 2 H's

# But hydrogens from different methyl groups are NOT equivalent (different molecular position)
assert hashes_full[1] != hashes_full[5]

# The two methyl carbons are also different
assert hashes_full[0] != hashes_full[4]


# ==== Now with depth=0 (only local properties) ====
hashes_local = mol.atomic_hashes(depth=0)

# Both methyl carbons now look the same (both are C with 4 bonds)
assert hashes_local[0] == hashes_local[4] == hashes_local[8]

# All methyl hydrogens are now equivalent (all bonded to C)
for h_idx in [1, 2, 3, 5, 6, 7, 9, 10]:
    assert hashes_local[1] == hashes_local[h_idx]

# But the carbon connected to O is still different (different neighbors)
assert hashes_local[0] != hashes_local[11]


# ==== With depth=2, we see intermediate behavior ====
hashes_d2 = mol.atomic_hashes(depth=2)

# Methyl 1 hydrogens are still equivalent
assert hashes_d2[1] == hashes_d2[2] == hashes_d2[3]

# But now different from methyl 2 (within 2 bonds they "see" different environments)
assert hashes_d2[1] != hashes_d2[5]

Example 3: Effect of hashing options

Different options capture different aspects of molecular structure. Most of the options relate to connectivity (bonds), but rs_stereo and ez_stereo look at the 3D information to distinguish between isomers or enantiomers.

from scm.base import ChemicalSystem

# Lactic acid enantiomers
d_lactic = ChemicalSystem.from_smiles("C[C@H](C(=O)O)O")
l_lactic = ChemicalSystem.from_smiles("C[C@@H](C(=O)O)O")

print("With rs_stereo=False, the enantiomers are identical:")
print(f"   d_lactic hash : {d_lactic.hash(rs_stereo=False)}")
print(f"   l_lactic hash : {l_lactic.hash(rs_stereo=False)}")

print("With rs_stereo=True, the enantiomers are different:")
print(f"   d_lactic hash : {d_lactic.hash(rs_stereo=True)}")
print(f"   l_lactic hash : {l_lactic.hash(rs_stereo=True)}")

# 2-Butene cis/trans isomers:
cis = ChemicalSystem.from_smiles("C/C=C\C")
trans = ChemicalSystem.from_smiles("C/C=C/C")

print("With ez_stereo=False, the isomers are identical:")
print(f"   cis hash   : {cis.hash(ez_stereo=False)}")
print(f"   trans hash : {trans.hash(ez_stereo=False)}")

print("With ez_stereo=True, the isomers are different:")
print(f"   cis hash   : {cis.hash(ez_stereo=True)}")
print(f"   trans hash : {trans.hash(ez_stereo=True)}")

How it works

The hashing algorithm operates in three stages:

  1. Local atomic hashing: Each atom receives an initial hash based on its element type and local chemical properties (coordination number, bond orders, ring membership, etc.)

  2. Stereochemistry incorporation (optional): The hashes are updated to include stereochemical information for double bonds (E/Z isomerism) and chiral centers (R/S configuration)

  3. Environment propagation (optional, enabled by default): Each atom’s hash is combined with hashes from neighboring atoms at increasing distances, allowing the hash to capture increasingly larger regions of the molecular graph

The molecular hash is computed by summing all atomic hashes, producing a single integer that characterizes the entire structure.

Important notes:

  • Hash values are deterministic but arbitrary — the numerical value itself has no chemical meaning, and small structural changes produce completely different hash values

  • Hashes are not smooth features suitable for machine learning; they are designed for exact structure matching

  • The algorithm requires bonding information — use guess_bonds() before hashing if bonds aren’t already defined

  • Hash collisions are theoretically possible but extremely rare with appropriate options

  • Conformers will all have the same hash values. Enantiomers / Isomers will have different values if rs_stereo and ez_stereo are enabled

Recommended settings:

  • Basic molecular comparison: Use defaults

  • Stereochemistry-aware comparison: Enable ez_stereo=True and rs_stereo=True

  • Detailed chemical environment: Enable bond_orders=True and ring=True

  • For comparing local chemical environment of atoms: Use a finte depth, typically 1 or 2 (but depends on your application)

Note

This hashing functionality replaces the molsg library from earlier versions of the Amsterdam Modeling Suite. The new implementation is integrated directly into the ChemicalSystem class, eliminating the need for separate molsg calculations.

Despite the hashing algorithm following the same general principles, the actual hash values between the ChemicalSystem and the molsg library will be different.

API Reference

ChemicalSystem.hash(depth: int = -1, coordination: bool = True, bond_orders: bool = False, ez_stereo: bool = False, rs_stereo: bool = False, ring: bool = False, implicit_h: bool = False) int

Compute a hash-based identifier for the entire molecule.

Generates a single integer hash value that uniquely identifies the molecular structure. This hash is computed by combining the atomic hashes (see atomic_hashes()) of all atoms in the molecule. Two molecules with identical structures/connectivity will produce the same hash value regardless of atom ordering, coordinates (rotation/translation).

The hashing algorithm is based on molecular connectivity (bonding topology). By default, different conformations of the same molecule (same bonds, different 3D coordinates) produce the same hash. However, when stereochemistry options (ez_stereo, rs_stereo) are enabled, stereoisomers (cis/trans, R/S enantiomers) are distinguished and produce different hash values.

Parameters:
  • depth: Hash propagation depth through the molecular graph (controls how much structural information is incorporated):

    -1 (default): Unlimited propagation - the hash reflects the complete molecular structure. Recommended for whole-molecule comparison.

    0: No propagation - hash based only on local atomic properties without considering connectivity patterns beyond immediate neighbors.

    n > 0: Propagate information up to n bonds away. Smaller values produce hashes that are less sensitive to distant structural features.

  • coordination: Include coordination number (number of bonds per atom) in the hash. Enabled by default. Essential for distinguishing most structural differences.

  • bond_orders: Include bond order information (single/double/triple bonds). When disabled (default), molecules differing only in bond orders may produce the same hash. Enable to distinguish resonance structures or different bond order assignments.

  • ez_stereo: Include E/Z stereochemistry around double bonds. When enabled, cis and trans isomers produce different hashes. Requires meaningful 3D coordinates to determine stereochemical configuration.

  • rs_stereo: Include R/S stereochemistry at chiral centers. When enabled, enantiomers produce different hashes. Requires meaningful 3D coordinates to determine absolute configuration.

  • ring: Include ring membership information. When enabled, distinguishes atoms in rings from acyclic atoms and incorporates ring sizes.

  • implicit_h: Treat hydrogen atoms implicitly. When enabled, hydrogen atoms contribute zero to the hash, and non-hydrogen atoms include a contribution based on bonded hydrogens. Useful for comparing structures with different hydrogen representations (explicit vs. implicit).

Returns:

Integer hash value identifying the molecular structure. The hash is deterministic and reproducible but the numerical value itself is arbitrary.

Notes:
  • Requires bonding information: Call guess_bonds() first if bonds aren’t defined

  • Atom-ordering invariant: Reordering atoms in the molecule produces the same hash

  • Coordinate invariant (by default): Rotating, translating, or reflecting the molecule produces the same hash (unless ez_stereo or rs_stereo are enabled)

  • Parameter consistency critical: When comparing molecules, use identical parameter settings for both. Different parameters will produce incomparable hash values.

  • Collisions extremely rare: With default settings, hash collisions between different molecules are theoretically possible but exceptionally unlikely for typical organic molecules

Warnings:
  • Comparing hashes computed with different parameter settings is meaningless

  • Stereochemistry options require valid 3D coordinates; incorrect coordinates may produce unreliable stereochemical assignments

See Also:
  • atomic_hashes(): Compute individual hashes for each atom

ChemicalSystem.atomic_hashes(depth: int = -1, coordination: bool = True, bond_orders: bool = False, ez_stereo: bool = False, rs_stereo: bool = False, ring: bool = False, implicit_h: bool = False) List[int]

Compute hash-based atomic structure identifiers for all atoms.

Generates unique integer hashes for each atom based on their local chemical environment. These hashes enable identification of chemically equivalent atoms, substructure matching, and atom-level structure comparison. The hashes are designed for exact structure identification and isomorphism detection - NOT for machine learning features (the hash values are not “smooth” and small structural changes produce completely different values).

The hashing algorithm combines atomic properties (element type, coordination, bond orders, etc.) with information from the surrounding molecular graph. By default, each atom’s hash reflects its complete molecular environment.

Parameters:
  • depth: Hash propagation depth through the molecular graph (controls how much of the molecular environment is incorporated into each atom’s hash):

    -1 (default): Unlimited propagation - each atom’s hash incorporates information from the entire molecule, making hashes sensitive to the complete molecular structure.

    0: No propagation - hashes reflect only immediate atomic properties (element, coordination number, bond orders, etc.) without considering neighbors.

    n > 0: Propagate information up to n bonds away - hashes incorporate environment information from atoms within n bonds, enabling local similarity comparisons while ignoring distant parts of the molecule.

  • coordination: Include coordination number (number of bonds) in the hash. Enabled by default. Distinguishes atoms with different bonding patterns.

  • bond_orders: Include bond order information (single/double/triple bonds). When enabled, the hash distinguishes between C-C, C=C, and C≡C bonding. Bond orders are mapped to standard values (1.0, 1.5, 2.0, 3.0, etc.).

  • ez_stereo: Include E/Z stereochemistry for double bonds. When enabled, cis and trans isomers produce different hashes. Requires 3D coordinates to determine stereochemical configuration.

  • rs_stereo: Include R/S stereochemistry for chiral centers. When enabled, enantiomers produce different hashes. Requires 3D coordinates to determine handedness at tetrahedral stereocenters.

  • ring: Include ring membership information. When enabled, atoms in rings are distinguished from non-ring atoms, and the smallest ring size is incorporated into the hash.

  • implicit_h: Treat hydrogen atoms implicitly. When enabled:

    • All hydrogen atoms receive a hash value of zero

    • Non-hydrogen atoms get a contribution based on the number of bonded hydrogens

    • Useful for comparing structures with different hydrogen representations

Returns:

List of integers (one per atom). Atoms in identical chemical environments have identical hash values. Hash values are deterministic and reproducible but arbitrary as numbers (a small structural difference typically produces a completely different hash value).

Notes:
  • Requires bonding information: Call guess_bonds() first if bonds aren’t defined

  • Hash ordering invariance: Reordering atoms produces the same hash values (though in different positions in the returned list)

  • Parameter consistency: When comparing atoms across molecules, use identical parameter settings

  • Stereochemistry requires 3D: The ez_stereo and rs_stereo options require meaningful 3D coordinates

  • Not for ML: These are discrete identifiers, not continuous features suitable for machine learning

See Also:
  • hash(): Compute a single hash for the entire molecule