Train an eCE model

Training an eCE model requires two steps:

  1. Constructing a database: the database contains the configurations and the associated property.

  2. Fitting the model: the model is trained by optimizing a loss function using the PyTorch library.

Construct a database

In order to construct a database, one needs to map the structure onto the ideal lattice and process the data into an object that can be read by the eCE model. The commande data in the ece is meant for constructing a DataBase object from files containing the structure geometries and the properties of interest related to the structure. Arguments must directly be given via the command line interface. The accepted arguments are summarized in the table below.

Argument

Shortcut

Type

Help

Default

Notes

model

m

str

Path to the model.

model.pth

data

d

list[str]

Data containing the path to the file containing the structure, its associated property, and its weight during fitting (optional, default: 1) in every line.

Use either batch or data. This argument can be used several times in case of more than one data.

batch

b

str

Path to a text file containing the paths to the file containing the structure, its associated property, and its weight during fitting (optional, default: 1) in every line.

Use either batch or data.

max_def

float

Maximum deformation allowed in the structure.

0.2

Helps filter out data that are relaxed to another lattice

max_disp

float

Maximum atomic displacement allowed in the structure in Angstroms.

0.5

Helps filter out data that are relaxed to another lattice

max_mapping_cost

float

Critical value of the total cost of the mapping to accept.

0.5

Helps filter out data that are relaxed to another lattice

lattice_weight

float

Weighting factor ([0,1]) for the lattice mapping cost in the total cost calculation.

0.5

max_vac

float

Maximum vacancy concentration allowed in the structure.

0.0

Indicates the vacancy concentration range.

min_vac

float

Minimum vacancy concentration allowed in the structure.

0.0

Indicates the vacancy concentration range.

name

n

str

Path to save the processed data in the pickle database format.

database.pkl

verbose

v

store_true

Print information about the processed data.

False

Tip

The function can be used in the command line interface when using the ece module. Internally, the module calls the process_data() function. This latter also accepts a dictionary with the same arguments as keys and can be called in a Python script.

data Arguments

model

It specifies the name of the model (and especially the underlying lattice found in the Prim object) to map the configurations onto the same ideal lattice. PyeCE eCE models are saved either as the Pytorch default settings (typically use a .pt or .pth extension) or as a gz-compressed tarfile if the eci model is not built using the from_settings building attribute (use a .tar.gz extension).

data

It contains information about the data to be loaded in the database and expects 2 (3) input arguments as:

-d path_to_structure property (weight)
  1. The path_to_structure indicates where to find the file containing the structure geometry. Generally, this file corresponds to a VASP POSCAR. However, any file that can be read by Pymatgen is accepted.

  2. The property corresponds to the configurationally-dependent property of interest that is intended be infered. The property is expected to be an extensive quantity. In cluster expansion, this property is typically the energy or the formation energy.

  3. The weight is a positive value that is used during training to attach more importance to some data compared to others. The weights are automatically normalized during training. It is thus not necessary to normalize them. The weights are optional and their default values are 1.

This argument can be used several time, once for each data to be included in the database.

batch

The argument refers to the filename of a file containing information about a batch of data to be loaded in the database. The file is expected to have the following format:

path_to_structure_1 property_1 weight_1
path_to_structure_2 property_2 weight_2
path_to_structure_3 property_3 weight_3
...                 ...        ...

Every line contains information related to one data as explained above for the data argument. If weights are specifies for some data, the first line expects at least 2 delimiters to indicate that 3 columns are present.

max_def

The algorithm used to map the structures onto the lattice is a fast method but that relies on the assumption that the deformations of the lattice are small. Under large deformation, the behavior is not well determined. To prevent unpredictable behavior, it is recommanded to filter out structures with deformation larger than a threshold value set by this argument.

Given an ideal lattice \(L\) and the lattice of the structure to be mapped \(L'\), the structure lattice is transformed to the ideal lattice by a deformation as \(L = L' F\). The structure is accepted if:

\[\lambda_i -1 < \text{max_def} \quad \forall \lambda_i \in \Lambda \quad \text{ such that } \quad F = L' L^{⁻1} = Q \Lambda Q^{-1}\]

max_disp

As for the max_disp argument, the algorithm used to map the structures onto the lattice is a fast method but that relies on the assumption that the atomic displacements are small. Under large atomic displacements, the behavior is not well determined. To prevent unpredictable behavior, it is recommanded to filter out structures with atomic displacements larger than a threshold value set by this argument.

Given that both the ideal structure and the structure to be mapped have the same lattice, the atomic displacement of the ith atom is \(\delta_i = |\vec{s'_i} - \vec{s_i}|\), with \(\vec{s}\) and \(\vec{s'}\) are the atomic position of the ideal structure and the structure to be mapped. The structure is accepted if:

\[\delta_i < \text{max_disp} \quad \forall i \in \text{structure}\]

max_mapping_cost

A mapping score is computed as in Comparing crystal structures with symmetry and geometry. The mapping cost consists of the contributions of two costs:

  1. Lattice cost: cost based on the deformation matrix to map the child and parent lattices.

  2. Basis cost: cost based on the displacement field to map the child and parent atomic sites.

The mapping cost is computed as a weighted average of those two costs as:

\[\text{mapping_cost} = \text{lattice_weight} \times \text{lattice_cost} + (1-\text{lattice_weight}) \times \text{basis_cost}\]

This parameter sets the maximum value allowed for the mapping cost. If the mapping cost exceeds that critical value, the mapping is rejected.

lattice_weight

It specifies the weights associated with the lattice and basis costs to compute the mapping cost.

max_vac

It indicates the maximum vacancy concentration that is allowed in the structure. Vacancies (denoted by Va) must be allowed in the Prim object in case vacancies are expected.

min_vac

It indicates the minimum vacancy concentration that is allowed in the structure. Vacancies (denoted by Va) must be allowed in the Prim object in case vacancies are expected.

name

It specifies the path to save the database in a pickle dataframe format.

verbose

Set it to print information during the mapping of the data.

data Output

As an output, the data functionality creates a DataBase database and save it in the pickle format. The data frame has the following entries for each data contained in it:

  1. The composition: The composition for each species present in the system. For each species corresponds a column named as x_Xx with Xx being the symbol of the specie.

  2. The number of unitcells: The number of unitcells present in the configuration is indicated in the column named scel_size.

  3. The number of atoms: The number of atoms present in the configuration is indicated in the column named n_atoms.

  4. The property: The value of the property associated to the configuration is indicated in the column named property.

  5. The weight: The value of the weight associated to the configuration during training is indicated in the column named weight.

  6. The lattice mapping cost: The lattice cost to map the lattice of the input structure to the ideal structure is indicated in the column named lattice_cost.

  7. The atomic mapping cost: The atomic cost to map the atomic coordinates of the input structure to the ideal structure is indicated in the column named atomic_cost.

  8. The total mapping cost: The total cost to map the input structure to the ideal structure is indicated in the column named mapping_cost.

  9. The ideal structure: The mapped ideal structure in the format of the string VASP POSCAR file is indicated in the column named structure.

  10. The input structure: The original input structure in the format of the string VASP POSCAR file is indicated in the column named input_structure.

  11. (Optional) The ConfigData: The configData object in the format of a dictionary is indicated in the column named ConfigData.

Fit an eCE model

The fitting process of an eCE model relies on the minimization of a loss function by means of iterative methods such as the Stochastic Gradient Descent (SGD) algorithm. The commande fit in the ece is meant for training an eCE model from a database. Arguments can directly be given via the command line interface or via a YAML settings file. When both arguments are given via the command line interface and via a settings YAML file, the priority is given to the arguments given in the YAML settings file. The accepted arguments are summarized in the table below.

Argument

Shortcut

Type

Help

Default

Notes

model

m

str

Path to the model to fit.

model.pth

device

str

Device to be used for PyTorch operations.

cpu

cpu

store_true

Use cpu device for PyTorch operations.

Shortcut for --device cpu

cuda

store_true

Use cuda device for PyTorch operations.

Shortcut for --device cuda

train_file

t

str

Path to the training dataset in pickle dataframe format of a DataBase object.

If used, valdidation and test sets can be added through the --valid_file and --test_files arguments.

valid_file

str

Path to the validation dataset in pickle dataframe format of a DataBase object.

To be used together with the --train_files argument.

test_files

str

Paths to the test datasets in pickle dataframe format (allows for several test datasets) of DataBase objects.

To be used together with the --train_files argument.

data_file

d

str

Path to the dataset in pickle dataframe format of a DataBase object containing all data available for training.

If used, it assumes random splitting of the dataset into training, validation, and test sets through the --train_prop, --valid_prop, and --test_prop arguments.

train_prop

float

Proportion of data in the training dataset.

1.0

To be used with the --data_file argument.

valid_prop

float

Proportion of data in the validation dataset.

To be used with the --data_file argument.

train_prop

float

Proportion of data in the test dataset (allows for several test datasets).

To be used with the --data_file argument.

train_indexes

str

List of string slices ‘start:end:step’ (with colon-separations) of indexes of data to include in training dataset (default: start=0, end=len(dataset), step=1).

To be used with the --data_file and --train_prop arguments.

valid_indexes

str

List of string slices ‘start:end:step’ (with colon-separations) of indexes of data to include in validation dataset (default: start=0, end=len(dataset), step=1).

To be used with the --data_file and --train_prop arguments.

test_indexes

str

List of string slices ‘start:end:step’ (with colon-separations) of indexes of data to include in test dataset (default: start=0, end=len(dataset), step=1).

To be used with the --data_file and --train_prop arguments. Repeted arguments results in several test datasets.

batch_size

int

Batch size for each stochastic training step.

50

To be used with the --data_file and --train_prop arguments.

ladder_steps

l

int

List with the number of orbits to include in each ladder training step.

If not set, one single ladder step including all orbits is performed (default).

epochs

e

int

List of the number of epochs for each ladder step.

[50]

If only one value is given, the same number of epochs is applied to all steps.

lr

r

float

List of the learning rates for each ladder step.

If only one value is given, the same learning rate is applied to all steps.

lr_embedding

float

List of the learning rates for the embedding matrix for each ladder step.

If only one value is given, the same learning rate is applied to all steps. If not specified, the same value as for the --lr is assumed.

weight_decay

w

float

List of the weight decays / L2 regularizations for each ladder step.

If only one value is given, the same weight decay is applied to all steps.

optimizers

o

str

List of the optimizers for each ladder step where each optimizer is written in string dictionary.

[‘{“Adam”: {“amsgrad”: true}}’]

If only one value is given, the same optimizer is applied to all steps.

schedulers

c

str

List of the schedulers for each ladder step where each scheduler is written in string dictionary.

[‘{“ConstantLR”: {“factor”=1, “total_iters”=0} }’]

If only one value is given, the same scheduler is applied to all steps. The default scheduler results in no change in the learning rate with the number of epochs.

earlystopping_patience

p

int

List of the earlystopping_patience for each ladder step.

If no value is set, early stopping is not applied. If only one value is set, it reults in the same early stopping applied to all steps. It requires a validation dataset to be employed.

loss_fct

f

str

Name of the loss function from PyTorch loss functions.

MSELoss

settings

s

str

Path to a YAML file containing the settings to fit the eCE model.

name

n

str

Path to save the parameterized model.

model.pth

log

g

store_true

Set it in order to save a log file with the indexes of the datasplit and evaluated errors along the ladder training.

False

The generated log file is saved into the same folder as the model (’–name’ argument) with ‘log_fit.json’ basename.

verbose

v

store_true

Print information about the model.

False

Tip

The function can be used in the command line interface when using the ece module. Internally, the module calls the fit_model() function. This latter also accepts a dictionary with the same arguments as keys and can be called in a Python script.

fit Arguments

model

It specifies the name of the model to train. PyeCE eCE models are saved either as the Pytorch default settings (typically use a .pt or .pth extension) or as a gz-compressed tarfile if the eci model is not built using the from_settings building attribute (use a .tar.gz extension).

device

It indicates the PyTorch device to be used during PyTorch operations.

cpu

This argument indiates that the CPU device must be used during PyTorch operations. This is a shortcut to the --device cpu argument.

cuda

This argument indiates that the CUDA device must be used during PyTorch operations. This is a shortcut to the --device cuda argument.

train_file

It specifies the filename of the database containing the training data. All data included in this database are used in the training stage. When this argument is used, it should be used with the valid_file and test_files arguments. It is more convenient to use this method if data need to be split in a controlled fashion.

valid_file

It specifies the filename of the database containing the validation data. All data included in this database are used in the validation stage. This argument should be used with the train_file argument.

test_files

It specifies (possibly several) filenames of the databases containing the test data. All data included in these databases are used in the testing stage. This argument should be used with the train_file argument.

data_file

It specifies the filename of the database containing the all data used during the training process. This database is randomly then split into training, validation, and test datasets. When this argument is used, it is intended to be used together with the train_prop, valid_prop, test_prop, train_indexes, valid_indexes, and test_indexes arguments. It is more convenient to use this method if data are not required to be split in a controlled fashion and/or if several fits are performed with different splits.

train_prop

It indicates the proportion of data that should be included in the training dataset. Proportions do not need to be normalized.

valid_prop

It indicates the proportion of data that should be included in the validation dataset. Proportions do not need to be normalized.

test_prop

It indicates the (possibly several) proportions of data that should be included in the test datasets. Proportions do not need to be normalized.

train_indexes

It specifies indexes data in the database that should be included in the training dataset. When this argument is set together with the train_prop, these data are included after the random split. The argument is expected to be a list string slices start:end:step where start is the first index, end is the last index, and step is the step size. If only one argument is given, it is treated as a single index. If two arguments are given, the step is set to 1 by default. The default start and end are 0 and the last index of the dataset, respecticely.

As an example, given a database with 50 data, the command:

--train_indexes :5 10 20:30:2 40:-3

yields the indexes [0, 1, 2, 3, 4, 10, 20, 22, 24, 26, 28, 40, 41, 42, 43, 44, 45, 46].

valid_indexes

It specifies indexes data in the database that should be included in the validation dataset. When this argument is set together with the valid_prop, these data are included after the random split. The argument is expected to be a list string slices start:end:step where start is the first index, end is the last index, and step is the step size. If only one argument is given, it is treated as a single index. If two arguments are given, the step is set to 1 by default. The default start and end are 0 and the last index of the dataset, respecticely.

As an example, given a database with 50 data, the command:

--valid_indexes :5 10 20:30:2 40:-3

yields the indexes [0, 1, 2, 3, 4, 10, 20, 22, 24, 26, 28, 40, 41, 42, 43, 44, 45, 46].

test_indexes

It specifies indexes data in the database that should be included in the test datasets. When this argument is set together with the test_prop, these data are included after the random split. The argument is expected to be a list string slices start:end:step where start is the first index, end is the last index, and step is the step size. If only one argument is given, it is treated as a single index. If two arguments are given, the step is set to 1 by default. The default start and end are 0 and the last index of the dataset, respecticely. The argument can be used several time to create several test datasets if needed.

As an example, given a database with 50 data, the command:

--test_indexes :5 10 20:30:2 40:-3 --test_indexes 20:30:2

yields the indexes [0, 1, 2, 3, 4, 10, 20, 22, 24, 26, 28, 40, 41, 42, 43, 44, 45, 46] for the first test dataset and [20, 22, 24, 26, 28] for the second one.

batch_size

In order to reduce overfitting and memory overload, the training data are not used all together at onve during an optimization step, but rather an iteration of random “minibatches” of data is performed. This argument sets the size of these minibatches. The training is performed using the PyTorch DataLoader object (see Preparing your data for training with DataLoaders).

ladder_steps

This arguments defines the number of first orbits to include in each ladder training step. No value results in performing one single fit with all orbits at once. Training an eCE model can be challenging as one can easily get stuck in local minima. One solution to help prevent getting stuck in such poor minima is to perform ladder training. This method consists in performing an iterative training starting with a small set of orbits and slowly increasing the number of orbits to eventually perform a fit with all the orbits specified when building the model (see Construction of an eCE model). By default, the orbits are sorted by body-order (number of sites in a cluster) and then by maximum length of the cluster. The strategy is best describes as follows:

  1. Perform a first fit on a set of small, low-body-order, and compact orbits.

  2. Incorporate a few more orbits by increasing their size and body-order and perform a new fit starting from the previous fit.

  3. Repeat the previous point (2.) until all orbits are included.

Below is an example of a ladder training of an eCE containing the first nearest neighbor pair (1NNP), the second nearest neighbor pair (2NNP), the third nearest neighbor pair (3NNP), the fourth nearest neighbor pair (4NNP), and the first nearest neighbor triplet clusters (1NNT). Running the command:

--ladder_steps 2 4 5

results in training the model with the 1NNP and 2NNP orbits in the first ladder step, with the 1NNP, 2NNP, 3NNP, and 4NNP in the second ladder training step, and finally train the model with all orbits (1NNP, 2NNP, 3NNP, 4NNP, and 1NNT) in the last ladder step as pictured in the figure below.

_images/ladder_training_illustration.png

Illustration of a 3-step ladder training where the 1NNP and 2NNp are included first, then the 3NNp and 4NNP are added next, and then all orbits are included in the fitting process.

epochs

It specifies the number of epoch to perform for each ladder step. If only one value is given, the same number of epoch is set to all ladder steps.

lr

It specifies the initial learning rate to use during training for each ladder step (see torch.optim for more information on the optimization implementation in PyTorch). If only one value is given, the same learning rate is set to all ladder steps.

lr_embedding

It specifies the initial learning rate to use during training of the embedding matrix for each ladder step. If only one value is given, the same learning rate is set to all ladder steps. If not the set, the same value as for the lr argument is set. In some cases, using a learning rate for the embedding matrix different (smaller) than that used to train the ECI energy model help prevent getting trapped in a local minimum.

By default, the embedding matrix is initialized by finding similarities between species based on physical/chemcial properties of the pure species (see chemical_preconditioning()). It has been reported that this type of initialization is a fairly good starting point (see Constructing multicomponent cluster expansions with machine-learning and chemical embedding. A learning rate too important might result in going out of the minima in which the initialized embedding matrix was located. For this reason, it is sometimes useful to use different learning rate for the embedding matrix and the ECI energy model.

weight_decay

It specifies the weight decays (i.e., the L2 regularizations) for each ladder step (see torch.optim for more information on the optimization implementation in PyTorch). If only one value is given, the same weight decay is set to all ladder steps. L2 regularization help prevent overfitting by imposing a cost on the L2-norm of the weight. The algorithm optimizes the function \(\mathcal{L}\) defines as:

\[\mathcal{L}(\vec{w}; X) = \underbrace{\text{loss_fct}(\vec{w}; X)}_{\text{Loss function}} + \underbrace{\text{weight_decay} \times \vec{w}^T \vec{w}}_{\text{$L_2$-regularization}}\]

optimizers

It indicates the optimizers to be used at each ladder step. The optimizers need to be chosen from the list of PyTorch optimizer algorithm. If only one optimizer is given, the same optimizer is used for each ladder step. Because every optimizer has its own set of arguments, the optimizer must be given in the form of a dictionary (or string dictionary in the CLI) where the first level only contains as a unique key the name of the algorthm to be used in torch.optim and specific settings related to the algorithm are found in the next level of the dictionary.

As an example, an ADAM optimizer with amsgrad flag is employed where the default betas and eps are modified:

--optimizers '{"ADAM": {"amsgrad": true, "betas": [0.8, 0.999], "eps": 1e-7}}'

Note

The --lr and --weight_decay parameters could also be directly set here as settings of the optimizer.

schedulers

It indicates the learning rate scheduler to be used at each ladder step. The learning scheduler is a tool to change the automatically change the value of the learning rate as time passes (i.e., as the number of performed epochs increases). It needs to be chosen from the list of PyTorch learning rate schedulers. If only one scheduler is given, the same scheduler is used for each ladder step. Because every scheduler has its own set of arguments, the scheduler must be given in the form of a dictionary (or string dictionary in the CLI) where the first level only contains as a unique key the name of the scheduler to be used in torch.optim.lr_scheduler and specific settings related to the scheduler are found in the next level of the dictionary. The optimizer settings is automatically set from the --optimizers argument.

As an example, a multistep learning rate scheduler (MultiStepLR) is used where the learning rate is multiplied by 0.1 after 55, 85, and 95 epochs in the first ladder step and then a step learning rate (StepLR) is used where the learning rate is multiplies by 0.2 every 30 epochs in the second ladder step:

--schedulers '{"MultiStepLR": {"milestones": [55, 85, 95], "gamma": 0.1}}' '{"StepLR": {"step_size": 30, "gamma": 0.2}}'

earlystopping_patience

It specifies the time to weight before stopping the training in the case the validation loss is not decreasing, i.e., it sets the patience setting in the EarlStopping object, for each ladder step. If only one value is given, the same patience is used for each ladder step. When the validation loss increases, the weight of the best validation loss are recorded so that if the number of epochs with no improvement reached the patience, the model returns to the state with the best validation loss.

loss_fct

It indicates the loss function to be used during training for all ladder steps. The available activation functions can be found in torch.nn. The algorithm optimizes the function \(\mathcal{L}\) defines as:

\[\mathcal{L}(\vec{w}; X) = \underbrace{\text{loss_fct}(\vec{w}; X)}_{\text{Loss function}} + \underbrace{\text{weight_decay} \times \vec{w}^T \vec{w}}_{\text{$L_2$-regularization}}\]

settings

A path to a yaml file containing all arguments can be given in place of or together with arguments given in the CLI. In case of conflict between arguments given in the settings file and in the CLI, the arguments in the settingsfile have the priority. Missing arguments are set to default values or to the values given in the CLI.

Below is an example of such a settings file. Two test datasets are generated, the first test dataset is randomly generated, while the second contains the last 160 data.

# Datasets
data_file: database.pkl              # Path to the database with all data
train_prop: 0.6                      # Proportion of data contained in the training dataset (60%)
valid_prop: 0.1                      # Proportion of data contained in the validation dataset (10%)
test_prop: [0.3, 0]                  # Proportion of data contained in the test datasets (30% and 0%)
train_indexes: [':81']               # Indexes of data to be contained in the train dataset (include the first 81 data in the training set)
test_indexes: [[], ['-160:']]        # Indexes of data to be contained in the test datasets (include the last 160 data in the second test set)

# Fitting parameters
batch_size: 50                       # Batch size for every stocastic step
ladder_steps:                        # Number of first orbits to include at each ladder step
- 4
- 6
- 8
- 10
- 11
epochs:                              # Number of epochs to perform at each ladder step
- 100
- 40
- 40
- 40
- 40
lr:                                  # Learning rate to apply at each ladder step
- 0.01
- 0.001
- 0.001
- 0.001
- 0.001
lr_embedding:                        # Learning rate to apply to the embedding matrix at each ladder step
- 1e-3
- 5e-4
- 5e-4
- 5e-4
- 5e-4
weight_decay: 1e-6                   # Weight decay to apply at each ladder step
optimizers:                          # String dictionary of the optimizer to apply at each ladder step
- Adam:
   amsgrad: true
schedulers:                          # String dictionary of the scheduler to apply at each ladder step
- MultiStepLR:
   milestones: [55, 85, 95]
   gamma: 0.1
- MultiStepLR:
   milestones: [15, 35]
   gamma: 0.1
- MultiStepLR:
   milestones: [15, 35]
   gamma: 0.1
- MultiStepLR:
   milestones: [15, 35]
   gamma: 0.1
- MultiStepLR:
   milestones: [15, 35]
   gamma: 0.1
earlystopping_patience:              # Early-stopping patience to apply at each ladder step
- 50
- 40
- 40
- 40
- 40
loss_fct: MSELoss                    # Loss function to be used in the training process

# General settings
name: fitted_model.tar.gz            # Name of the trained model
log: true                            # Create a log file with the training process data

log

This argument indicates whether to save a log file with the indexes of the datasplit and evaluated errors along the ladder training or not. The generated log file is saved into the same folder as the model (’–name’ argument) with ‘log_fit.json’ basename.

name

It specifies the name of the model once it is trained. PyeCE eCE models are saved either as the Pytorch default settings (typically use a .pt or .pth extension) or as a gz-compressed tarfile if the eci model is not built using the from_settings building attribute (use a .tar.gz extension). The tarfile contains 3 files:

  1. prim.json: a json file to be used construct a Prim object.

  2. ece.pth: a PyTorch file to reload the ECI energy model with the correct architecture.

  3. eCE_state.pth: a PyTorch file to reload the weights of the eCE model.

verbose

Set it to print information during the training of the model.