Data And Config Files

MACE_SCF uses MACE data loading, but the electrostatic models add extra data keys and training schedule structure. Most training examples split settings between a YAML config file and command-line arguments.

Training Script and Config Files

We specify the data processing and many training settings in a config.yaml file as well as a call to run_train.py, there are very many additional options copmared to MACE, so its reccomended to go over the docs before trying some fits.

An example training script and config.yaml for the Local Symmetric Charge MACE model would be:

train.sh:

python scripts/run_train.py \
    --name="split_charge_example" \
    --config=config.yaml \
    --train_file=$train \
    --valid_fraction=0.2 \
    --test_file=$train \
    --E0s=average \
    --model="LocalSplitCharges" \
    --hidden_irreps='64x0e + 64x1o' \
    --r_max=6.0 \
    --eval_interval=2 \
    --batch_size=8 \
    --valid_batch_size=2 \
    --error_table="PerAtomRMSE" \
    --amsgrad \
    --device=cuda \
    --atomic_multipoles_max_l=1 \
    --atomic_multipoles_smearing_width=1.5 \
    --atomic_formal_charges="{13: 0.0}" \
    --default_dtype="float64" \
    --restart_latest \
    --save_cpu

config.yaml:

heads:
  revPBEd3:
    info_keys:
      energy: AIMS_energy
      dipole: AIMS_dipole
      polarizability: AIMS_polarizability
      fermi_level: VBM
      external_field: homogeneous_field
      total_charge: total_charge
    arrays_keys:
      forces: AIMS_forces
      atomic_multipoles: AIMS_atom_multipoles
train_schedule:
  0:
    name: stage1
    start: 0
    end: 99
    loss:
      atomic_multipoles: 100.0
      dipole_per_atom: 1.0
      energy_per_atom: 1.0
      forces: 100.0
    lr: 0.01
  1:
    name: stage2
    start: 100
    end: 199
    loss:
      atomic_multipoles: 100.0
      dipole_per_atom: 1.0
      energy_per_atom: 1000.0
      forces: 100.0
    lr: 0.001

There are many new options in train.sh, the most important ones relate to handling periodicity and specifying the desciptio of the charge density in the model (atomic multipole order, smearing width, etc). You can find more information in atomic_multipoles and boundary_conditions.

The config.yaml mostly covers the data which is read from input xyz files, and the loss weights and training schedule, and is explained below.

Data Keys

The keys used to read data from train/test xyz files must be specified in the heads dictionary.

The info_keys subsection refers to anything in atoms.info.

heads:
  LDA:
    info_keys:
      energy: AIMS_energy
      dipole: AIMS_dipole
      polarizability: AIMS_polarizability
      fermi_level: dft_fermi_level
      external_field: homogeneous_field
      total_charge: total_charge

and arrays_keys is anything in atoms.arrays:

    arrays_keys:
      forces: AIMS_forces
      atomic_multipoles: AIMS_atom_multipoles

You don’t need to specify everything: if you are only training on energy and forces, just ommit the other fields.

Check if your data is loaded correctly

At the start of each training run, a line like this is printed in the log file:
INFO: Total Training set [energy: 219, dipole components: 219, fermi_level: 219, total_charge: 219, external_field: 219, stress: 0, head: 219, forces: 219, atomic_multipoles: 219]
Always check that these numbers agree with what you intended to load from the xyz file.

Config Dipole Weight

MACE-style extxyz files can carry per-configuration weights using keys of the form config_<property>_weight. MACE_SCF uses this for component-wise dipole masking.

If a config contains the dipole key specified in heads.info_keys.dipole, it must also contain an explicit config_dipole_weight.

config_dipole_weight must be a 3-vector in Cartesian component order. It controls which dipole components contribute to dipole losses and error metrics. You need to set this differently depending on what the dipole physically means in the dataset:

for molecular clusters where all dipole components are meaningful, config_dipole_weight="1.0 1.0 1.0"
for slabs where only the surface-normal dipole is meaningful, config_dipole_weight="0.0 0.0 1.0"

If you have periodic configurations, the dipole is generally not a valid quantity unless it is computed via the modern theory of polarization, with care taken to handle branch cuts. To ignore the dipole on such a config, set config_dipole_weight="0.0 0.0 0.0" (or a partial mask). Configs that do not have the dipole info key at all are automatically given a zero dipole weight by the data loader and so do not contribute to dipole losses or metrics.

train_schedule

train_schedule defines one or more training stages. Each stage can set:

name
start and end epochs
loss
lr
more detailed options for SCF models (e.g. fixed_point_training_options) when relevant

For example:

train_schedule:
  0:
    name: stage1
    start: 0
    end: 99
    loss:
      atomic_multipoles: 100.0
      energy_per_atom: 10.0
      forces: 100.0
    lr: 0.01
  1:
    name: stage2
    start: 100
    end: 200
    loss:
      energy_per_atom: 1000.0
      forces: 100.0
    lr: 0.01

You can build whatever loss you want and change it between different stages, you can even add or remove components of the loss, as done in the example above. The loss can be composed from the available loss functions.

The training schedule generalizes the _stagetwo syntax in MACE.