Database

The submodule contain several class for providing an interface to a sql database for managing COSKF files and physical properties.

class pyCRS.Database.COSKFDatabase(path: str)

A class provide an interface to a sql database containing the following tables.

Table name

Description

Compound

Unique compounds with COSKF files by CAS number or identifier.

Conformer

Multiple conformers with corresponding COSKF files.

PhysicalProperty

User-defined physical properties.

PropPred

Estimated properties using QSPR methods from SMILES.

Parameters:

path (str) – Path to the database file. Created if it doesn’t exist.

Example:

db = COSKFDatabase("my_coskf_db.db")
db.add_compound("Water.coskf")
db.add_compound("Benzene.coskf",cas="71-43-2")
db.add_physical_property("Benzene", "meltingpoint", 278.7)
db.add_physical_property("Benzene", "hfusion", 9.91, unit="kJ/mol")
db.estimate_physical_property("Benzene")
add_compound(coskf_file: str, name: str | None = None, cas: str | None = None, identifier: str | None = None, coskf_path: str | None = None, smiles: str | None = None, nring: int | None = None, ignore_smiles_check: bool = False, ignore_duplicates: bool = False)

Adds a new .coskf file to the database.

Parameters:

coskf_file (str) – a path to the .coskf file, or alternatively, the file name of the .coskf file if the coskf_path is provided.

Keyword Arguments:
  • name (str, optional) – Compound name. Default to IUPAC name, CAS number, identifier, or .coskf file name if not specified. Can be set via keyword argument or read from the .coskf file.

  • cas (str, optional) – CAS number. If not provided, it will attempt to use the value from the .coskf file if available.

  • identifier (str, optional) – Chemical identifier of the compound.

  • coskf_path (str, optional) – Directory containing the .coskf file. Defaults to ADFCRS-2018 database path.

  • smiles (str, optional) – SMILES string. Defaults to the value in the .coskf file if available.

  • nring (int, optional) – Numbr of ring atoms. Defaults to the value from the .coskf file.

  • ignore_smiles_check (bool, optional) – If True, skips identity check via SMILES generation. Defaults to Fasle.

  • ignore_duplicates (bool, optional) – If True, skips duplicate recognition using UniqueConformersCrest in AMSConformer tool. Default to False.

Note

  • Each compound must have an unique CAS number or identifier.

  • During add_compound, CAS and identifier are checked for uniqueness in the database.

  • An error is raised if multiple compounds share the same CAS number and identifier.

  • The example below is invalid because both compounds use the same identifier, CRS0001.

db.add_compound("Benzene.coskf",cas="71-43-2",identifier="CRS0001")
db.add_compound("Ethanol.coskf",cas="64-17-5",identifier="CRS0001")
add_physical_property(identifier: str, attribute: str, value: float | str, unit: str | None = None)

Add a value of a physical property to the PhysicalProperty TABLE in the database using compound’s identifier

Parameters:
  • identifier (str) – CAS number, identifier or compound name.

  • attribute (str) – Name of the physical property (eg. meltingpoint or hfusion).

  • value (float or str) – Value of the physical property.

  • unit (str, optional) – the unit of the input value. The default units are K, kcal/mol and kcal/mol-K. The following units are accepted and will be automatically converted to the default units: - Temperature: K, C - Enthalpy: kcal/mol, kJ/mol, cal/g, J/g - Heat capacity: kcal/mol-K, kJ/mol-K, cal/g-K, J/g-K - pvap: bar, atm, Pa, mmHg

Note

The vp_equation accepts only parameters for pressure in bar and temperature in Kelvin.

db.add_physical_property("Benzene", "meltingpoint", 278.7)
db.add_physical_property("Benzene", "hfusion", 9.91, unit="kJ/mol")
db.add_physical_property("Benzene", "vp_equation", "Antoine")
db.add_physical_property("Benzene", "vp_params", "4.72583, 1660.652, -1.461")
db.add_physical_property("Benzene", "flashpoint", -11.63, unit="C")
#Vapor pressure at 353.25K is 1.01325 bar
db.add_physical_property("Benzene", "tvap", 353.25)
db.add_physical_property("Benzene", "pvap", 1.01325)
clear_physical_property(identifier: str | List[str] | None = None, attribute: str | List[str] | None = None)

Clears the value of a physical property in PhysicalProperty TABLE in the database by compound’s identifier

Parameters:
  • identifier (str or List[str], optional) – CAS number, chemical identifier or compound name as a string or a list of strings. If None, all compound are selected.

  • attribute (str or List[str], optional) – Specific property to clear as a string or a list of strings. If None, all properties are cleared.

db.clear_physical_property(["water", "benzene"])
del_row(dbrow: CompoundRow | Dict[str, List[CompoundRow]])

Remove a compound from the database and delete the corresponding .coskf file.

Parameters:

dbrow (CompoundRow or Dict[str, List[CompoundRow]]) – the row to remove from the database

del_row_by_conformer_id(conformer_id: int)

Remove the conformer from the database.

Parameters:

conformer_id (int) – A integer of intergers representing the conformer in the CONFORMER TABLE.

db.del_row_by_conformer_id(1)
del_rows(dbrows: List[CompoundRow] | Dict[str, List[CompoundRow]])

Remove multiple compounds from the database and delete the corresponding .coskf files.

Parameters:

dbrows (List[CompoundRow] or Dict[str, List[CompoundRow]]]) – the rows to remove from the database.

db.del_rows(db.get_compounds('benzene'))
estimate_physical_property(identifier: str | List[str] | None = None, compound_id: int | List[int] | None = None)

Estimate the physical properties using the property prediction tool and add the values to the PropPred TABLE in the database

Keyword Arguments:
  • identifier (str or List[str], optioanl) – CAS number, chemical identifier or compound name as a string or a list of strings.

  • compound_id (int or List[int], optional) – an integer or a list representing the compound ID(s).

Note

The QSPR descriptor used in the property prediction tool is determined from the SMILES string. The selection priority of SMILES is as follows: (1) User-provided SMILES via the add_compound() method. (2) SMILES read from the .coskf file. (3) SMILES generated by OpenBabel using the compound’s coordinates in the .coskf file. Please note that the automatically resolved SMILES may be incorrect for some molecules, for instance when bond orders cannot be automatically determined and species with charges.

db.estimate_physical_property("Benzene")
get_all_compounds() List[CompoundRow]

Retrive all compounds in the database

Returns:

The full list of CompoundRow instances in the database

Return type:

List[CompoundRow]

get_all_conformers() List[ConformerRow]

Retrives all conformers in the database

Returns:

The full list of ConformerRow instances in the database.

Return type:

List[ConformerRow]

get_all_physical_properties(source: str = 'PhysicalProperty') List[PhysicalPropertyRow] | List[PropPredRow]

Retrive all physical properties in the database

Parameters:

source (str, optional) – Source of the properties. - ‘PhysicalProperty’ (default): Returns properties from the PhysicalProperty table. - ‘PropPred’: Returns estimated properties from the PropPred table.

Returns:

A list of PhysicalPropertyRow instances or PropPredRow instances in the database.

Return type:

List(PhysicalPropertyRow) or List(PropPredRow)

get_attribute_by_compound_id(attributes: str | List[str], compound_id: int | List[int], source: str | List[str] | None = None)

Retrieve the list of values for compounds with specified compound_id(s) in the database

Parameters:
  • attributes (str or List[str]) – Attribute(s) to be retrieved.

  • compound_id (int or List[int]) – A integer or a list of intergers used to search for compounds in the COMPOUND TABLE.

  • source (str or List[str], optional) – The table used in the search. Default is COMPOUND TABLE and PhysicalProperty TABLE

Returns:

A list of tuples containing the values of the specified attributes for the compounds.

Return type:

list of attributes

db.get_attribute_by_compound_id("name", 1)
db.get_attribute_by_compound_id(["name", "cas", "hfusion"] 1)
db.get_attribute_by_compound_id(["name", "hfusion"], 1, source=["COMPOUND","PropPred"])
get_compounds(identifier: str | List[str]) Dict[str, CompoundRow]

Retrieves compounds from the COMPOUND TABLE in the database by matching CAS number, chemical identifier, or name.

Parameters:

identifier (str or List[str]) – CAS number, chemical identifier or compound name as a string or a list of strings.

Returns:

A dictionary where each key is an input identifier and its corresponding value is the CompoundRow instances.

Return type:

Dict[str, CompoundRow]

get_compounds_id(identifier: str | List[str]) List[int | None]

Retrieves compound id from the COMPOUND TABLE in the database by matching CAS number, chemical identifier, or name.

Parameters:

identifier (str or List[str]) – CAS number, chemical identifier or compound name as a string or a list of strings.

Returns:

A list of compound IDs corresponding to the input identifier. If a name is not found, None is returned at the corresponding position.

Return type:

List[Optional[int]]

get_conformers(identifier: str | List[str]) Dict[str, ConformerRow]

Retrieves conformers from the CONFORMER TABLE in the database by matching CAS number, chemical identifier, or name.

Parameters:

identifier (str or list) – CAS number, chemical identifier or compound name as a string or a list of strings.

Returns:

A list of ConformerRow instances that match the search criteria.

Return type:

Dict[str, ConformerRow]

get_physical_properties(identifier: str | List[str] | None = None, compound_id: int | List[int] | None = None, source: str = 'PhysicalProperty') List[PhysicalPropertyRow] | List[PropPredRow]

Retrive physical properties in the database by matching CAS number, chemical identifier, name or compound id.

Parameters:
  • identifier (str or List[str], optional) – CAS number, chemical identifier or compound name as a string or a list of strings. If None, compound_id must be provided.

  • compound_id (int or List[int], optional) – Compound ID as an integer or a list of integers. If None, identifier must be provided.

  • source (str, optional) – Source of the properties. - ‘PhysicalProperty’ (default): Returns properties from the PhysicalProperty table. - ‘PropPred’: Returns estimated properties from the PropPred table.

Returns:

A list of PhysicalPropertyRow or PropPredRow instances, depending on the source.

Return type:

List[PhysicalPropertyRow] or List[PropPredRow]

modify_attribute_by_compound_id(attribute: str, value: str | int, compound_id: int)

Modifies the value of a specified attribute for a given compound ID.

Parameters:
  • attribute (str) – Attribute to modify. It can be one of the following: ‘name’, ‘cas’, ‘identifier’, ‘smiles’, ‘nring’.

  • value (str or int) – the new value of the specified attribute.

  • compound_id (int) – an integer representing the compound ID.

db.modify_attribute_by_compound_id("identifier","InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H", 0)
update_compound_by_conformer_id(compound_id: int, conformer_id: int)

Update the data for a compound ID row in the COMPOUND TABLE using the data from a conformer ID row in the CONFORMER TABLE.

Parameters:
  • compound_id (int) – A integer representing compound id corresponding to a specific row in the COMPOUND TABLE of the database

  • conformer_id (int) – A integer representing conformer id corresponding to a specific row in the CONFORMER TABLE of the database

update_compound_by_lowestE(compound_id: int | List[int] | None = None)

Update the data for a compound ID row in the COMPOUND TABLE using the data from a conformer ID row with the lowest energy having the same compound ID in the CONFORMER TABLE.

Keyword Arguments:

compound_id (int or List[int], optional) – Compound ID as an integer or a list of integers. If None, updates all compounds in the database.

visualize_conformers(compound_id: int | None = None, identifier: str | None = None)

Visualize conformers in ascending order of conformers IDs.

Parameters:
  • compound_id (int, optional) – Compound ID for which conformers are visualized.

  • identifier (str, optional) – CAS number, chemical identifier or compound name.

class pyCRS.Database.CompoundRow(compound_id: int, conformer_id: int, name: str, cas: str, identifier: str, smiles: str, resolved_smiles: str, coskf: str, Egas: float, Ecosmo: float, nring: int)

A data class to represent the contents of a row in a COMPOUND TABLE in COSKFDatabase

compound_id

A unique identifer for a specific row in the COMPOUND TABLE of the database

Type:

int

conformer_id

A unique identifer for a specific row in the CONFORMER TABLE of the database

Type:

int

name

The name associated with the row in the COMPOUND TABLE

Type:

str

cas

The CAS number associated with the row, i.e., the compound

Type:

str

identifier

The chemical identifier associated with the row, i.e., the compound

Type:

str

smiles

The SMILES string provided by user

Type:

str

resolved_smiles

The derived SMILES string obtained using OpenBabel from the coordinates in the COSKF file.

Type:

str

coskf

The filename of the .coskf file stored in the local SCM_PYCRS_COSKF_DB directory

Type:

str

Egas

The gas phase bond energy rounded to 3 decimal places in kcal/mol

Type:

float

Ecosmo

The bond energy in a perfect conductor rounded to 3 decimal places in kcal/mol

Type:

float

nring

The number of ring atoms

Type:

int

db_path

The path to the .coskf file directory

Type:

str

get_full_coskf_path()

Returns the full path of the corresponding .coskf file

read_coskf()

Opens the .coskf file corresponding to the database entry and returns a scm.plams.KFFile instance

class pyCRS.Database.ConformerRow(conformer_id: int, compound_id: int, name: str, cas: str, identifier: str, smiles: str, resolved_smiles: str, coskf: str, Egas: float, Ecosmo: float, nring: int)

A data class to represent the contents of a row in a CONFORMER TABLE in COSKFDatabase

conformer_id

A unique identifer for a specific row in the CONFORMER TABLE of the database

Type:

int

compound_id

A unique identifer for a specific row in the COMPOUND TABLE of the database

Type:

int

name

The name associated with the row in the CONFORMER TABLE

Type:

str

cas

The CAS number associated with the row, i.e., the compound

Type:

str

identifier

The chemical identifier associated with the row, i.e., the compound

Type:

str

smiles

The SMILES string provided by user

Type:

str

resolved_smiles

The derived SMILES string obtained using OpenBabel from the coordinates in the COSKF file

Type:

str

coskf

The filename of the .coskf file stored in the local SCM_PYCRS_COSKF_DB directory

Type:

str

Egas

The gas phase bond energy rounded to 3 decimal places in kcal/mol

Type:

float

Ecosmo

The bond energy in a perfect conductor rounded to 3 decimal places in kcal/mol

Type:

float

nring

The number of ring atoms

Type:

int

db_path

The path to the .coskf file directory

Type:

str

get_full_coskf_path()

Returns the full path of the corresponding .coskf file

read_coskf()

Opens the .coskf file corresponding to the database entry and returns a scm.plams.KFFile instance

class pyCRS.Database.PhysicalPropertyRow(compound_id: int, meltingpoint: float, hfusion: float, cpfusion: float, boilingpoint: float, density: float, flashpoint: float, dielectricconstant: float, vp_equation: str, vp_params: str, tvap: float, pvap: float, Mn: float)

A data class to represent the contents of a row in a PhysicalProperty TABLE in COSKFDatabase

compound_id

A unique identifer for a specific row in the COMPOUND TABLE of the database

Type:

int

meltingpoint

melting temperature (K)

Type:

float

hfusion

enthalpy of husion (kcal/mol)

Type:

float

cpfusion

heat capacity of fusion (kcal/mol-K) calculated as the difference between the heat capacity in the liquid state and the heat capacity in the solid state.

Type:

float

boilingpoint

boiling pointK (K)

Type:

float

density

liquid density (kg/L)

Type:

float

flashpoint

flash point (K)

Type:

float

dielectricconstant

dielectric constant

Type:

flash

vp_equation

The vapor pressure equation to use. Unit in bar. Options include: ANTOINE, VPM1 and DIPPR101

Type:

str

vp_params

Parameters for the vp_equation, expressed as “A, B, C, D, E”

Type:

str

tvap

Temperature(K) at pvap

Type:

float

pvap

Pressure(bar) at tvap

Type:

float

Mn

polymer average molecular weight (g/mol)

Type:

float

Vapor Pressure Equations:
ANTOINE:

log10(P) = A - B/(C+T)

DIPPR101:

ln(P) = A + B/T + C*ln(T) + D*T**E

VPM1:

ln(P) = A/T + B*ln(T) + C*T + D

class pyCRS.Database.PropPredRow(compound_id: int, adopt_smiles: str, meltingpoint: float, hfusion: float, boilingpoint: float, density: float, flashpoint: float, dielectricconstant: float, vp_equation: str, vp_params: str)

A data class to represent the contents of a row in a PropPred TABLE in COSKFDatabase

compound_id

A unique identifer for a specific row in the COMPOUND TABLE of the database

Type:

int

adopt_smiles

The SMILES used for QSPR method

Type:

str

meltingpoint

melting temperature (K)

Type:

float

hfusion

enthalpy of husion (kcal/mol)

Type:

float

boilingpoint

boiling pointK (K)

Type:

float

density

liquid density (kg/L)

Type:

float

flashpoint

flash point (K)

Type:

float

dielectricconstant

dielectric constant

Type:

flash

vp_equation

The vapor pressure equation to use. Unit in bar. VPM1

Type:

str

vp_params

Parameters for the vp_equation, expressed as “A, B, C, D, E”

Type:

str