Introduction

Reliably representing the inter-atomic potential energy surface (PES) is core to the study of properties of molecules and materials in computational physics, chemistry, materials science, biology, etc. While electronic structure methods typically give accurate and transferable PES, they are prohibitively expensive for scaling to systems of more than thousands of atoms. On the other hand, empirical force fields are much more efficient but are inherently limited by their accuracy in many applications. By properly integrating machine learning (ML) methodologies and physical requirements like extensiveness and symmetries, various methods have emerged to address the accuracy v.s. efficiency dilemma in the realm of PES modeling1,2,3,4,5,6,7,8,9,10,11. Arguably, a new paradigm is forming: electronic structure methods are no longer used to generate the driving forces during molecular dynamics simulations but are used to generate data for training their alternatives, ML-based PES models.

Despite remarkable achievements of ML-based PES models12,13,14, challenges still remain. For a domain expert who would like to apply such methodologies in their applications, a natural first question is on the efforts needed for obtaining a reliable PES model: Are there ready-to-use PES models? If not, what would be the amount of training data and time cost required? Can we take advantage of the ever-increasing publicly-available training data?

To address these issues, there have been several efforts. On one hand, general-purpose models for various systems, such as silicon15, phosphorus16, water17, metals and alloys18,19,20,21,22, etc., have been developed and are directly applicable to relevant studies. However, the range of applicability of such models is typically limited to small conformation or chemical space. For example, for alloys, the majority of general-purpose ML models are developed for systems with at most two element types. On the other hand, several efficient data generation protocols have been developed23,24,25,26, of which a representative is DP-GEN25,26, a concurrent learning procedure that iteratively explores the configuration space using models trained with existing data, and then labels only those configurations with high uncertainty level. Even with these protocols, the computational effort needed for complicated systems is still prohibitive. For example, to train a fairly general-purpose model for the AlMgCu alloy system, 100k density functional theory (DFT)27,28 calculations were ultimately performed, resulting in the cost of ten million CPU core hours18.

With the accumulation of high-quality electronic structure data covering almost all the elements on the periodic table, it is becoming possible to systematically develop pretraining schemes, which have been widely adopted in areas like computer vision (CV)29,30 and natural language processing (NLP)31,32. In these schemes, one first trains a unified model on large-scale datasets and then finetunes it for downstream tasks, expecting that a good representation can be learned in the first stage, and the amount of supervised data needed for the second stage will be significantly reduced. Recently, the pretraining-finetuning idea has been applied to organic molecules systems for energy and force predictions33,34, and to tackle tasks beyond representing the PES35,36,37. Unfortunately, most ML-based PES models are premature for such schemes at scale in materials applications. Taking the widely used two versions of Deep Potential models6,7 as examples, the ML parameters are element-type-dependent, making it highly inefficient when the training data contains many elements.

Constant efforts have been devoted to adapt the architecture of the ML-based PES models for large datasets. Among them, one class of models named equivariant graph neural networks (GNN)38 that is built upon convolutions over atomic graphs of node and edge equivariant representations has shown promise of training on large datasets. SchNet5, PaiNN39, GemNet-OC40, DimeNet++41, PFP42, SCN43, SpinConv44 and Equiformer/EquiformerV245,46 are trained on the OC20/OC2M47 dataset containing about 133M/2M data frames of 56 elements. These models are benchmarked by the accuracy of energy, force and stable structure predictions. Very recently, it has been shown that introducing the attention architecture45 in a GNN model improves the performance on the OC20/OC2M dataset46. Chen and Ong48 proposed M3GNet, which was able to train on a subset of the Materials Project49 that contains 187,687 configurations encompassing 89 elements and labeled at the generalized gradient approximation (GGA)50 or GGA+U level. Takamota et. al.42 introduced the PFP model, which was trained on a dataset composed of molecular and crystal configurations including approximately 9 × 106 frames of 45 elements. Choudhary et. al.51 developed the ALIGNN model, and they were able to train the model on a subset of the JARVIS-DFT dataset52 that is composed of 307,113 data frames of 89 elements. The M3GNet, PFP, and ALIGNN models are proposed as “universal” potential models, however, their accuracies are not on par with PES models trained for specific materials applications.

The equivariant GNN models are potential candidates for pretraining, several issues worth special attention before applying them in downstream real-world applications. First, the GNN approaches are not well-suited for massively parallel molecular dynamics simulations53. The update of each GNN layer requires communications between spatially decomposed sub-regions of the system. In each evaluation of the energy and forces, in total several to a dozen such updates are required, which may lead to a substantial communication overhead in massively parallel high-performance super-computers. Second, some models, such as PaiNN, GemNet-OC, SCN, Equiformer/EquiformerV2, directly predict forces using rotationally equivariant networks39,40,45,54 instead of energy gradients with respect to atomic coordinates. Therefore, the predicted force is not conservative, which serves as a basic assumption in guaranteeing the accuracy of molecular simulations55. The DimeNet++41 Allegro53 models are conservative. Last but not least, some models, such as GemNet-OC, SpinConv, M3GNet, and ALIGNN are not smooth, i.e. a sudden energy jump may happen as the positions of atoms infinitesimally varies. This leads to non-conserved energy in the Hamiltonian dynamics simulations, which is used in computing the dynamical properties like diffusion constant and viscosity.

By far, how much the downstream materials applications benefit from the ML models trained on the large-scale datasets are still not clear. To answer the question, in this article, we propose DPA-1, a Deep Potential model with a gated attention mechanism. Designed with a local descriptor, this model is exceptionally well-suited for parallel simulations on large-scale systems containing millions of atoms56. Notably, DPA-1 predicts conservative forces, ensures smoothness and demonstrates outstanding efficacy in learning inter-atomic interactions. Moreover, once pretrained, DPA-1 can significantly decrease the supplementary efforts needed for subsequent downstream tasks. We tested DPA-1 on various systems and observed superior performance compared with existing benchmarks. Then we took AlMgCu alloy systems18 as an example, showing that after pretraining with single-element and binary samples, DPA-1 can save around 90% ternary samples compared with the DeepPot-SE model7. Finally, we pretrained DPA-1 using the OC20 dataset, which consists of 56 elements, and successfully applied it to various downstream tasks. We checked the interpretability of the pretrained model by looking into the learned embedding parameters for different element types, finding that the 56 elements are arranged on a spiral in the latent space, which has a natural correspondence with their physical properties on the periodic table. Above all, we believe that DPA-1 and the pretraining scheme will bring the field of molecular simulation to a new stage.

Results

We conducted a number of experiments to evaluate the performance of DPA-1, with its architecture illustrated in Fig. 1 and detailed in the Methods section. First, to test the model’s ability to transfer among different compositions, we trained it from scratch against various systems and tested it under several challenging schemes. Then, we used an AlMgCu dataset to test its ability to transfer to ternary systems upon pretraining with single-element and binary data. Finally, we pretrained DPA-1 using the OC2M subset in OC20 dataset47 and applied it to various downstream tasks. To illustrate the effectiveness of the type-embedding and attention schemes, we compared them against DeepPot-SE model7 in all the experiments. In the following, we shall introduce first the datasets we used, and then the experiments we conducted.

Fig. 1: Schematic illustration of DPA-1.
figure 1

a Flowchart from \({{{{\mathcal{A}}}}}^{i}\) and \({{{{\mathcal{R}}}}}^{i}\) to the atomic energy ei. b Structure of the Embedding net, which maps s(rji) and Ti, through multiple residual layers, to \({{{{\mathcal{G}}}}}^{i}\). c Self-attention mechanism on \({{{{\mathcal{G}}}}}^{i}\) through a standard scale-dot procedure gated by the angular information \({\hat{{{{\mathcal{R}}}}}}^{i}{({\hat{{{{\mathcal{R}}}}}}^{i})}^{T}\). d Fitting net structures, similar to Embedding net, from the descriptors \({{{{\mathcal{D}}}}}^{i}\) and Ti to final atomic energy ei.

Datasets

AlMgCu alloy systems18. This dataset is generated using DP-GEN26, a concurrent learning scheme. After exploring 2.73 billion alloy configurations (derived from ~2000 bulk and surface systems), only a small portion (~100k configurations) of them are labeled and then compose the compact dataset. The exploration runs in the whole concentration space, i.e., AlxCuyMgz with 0 ≤ x, y, z ≤ 1, x + y + z = 1, and x, y, z take discrete values permitted by the finite-size simulation boxes. We can divide the systems into single, binary, and ternary subsets, in the name of the number of non-zero x, y, and z. The configuration space covers a temperature range of around 50.0 K to 2579.8 K and a pressure range of around 1 bar to 50,000 bar.

Solid-state electrolyte (SSE) systems57. These systems contain Li10XP2S12-type SSE materials, where X represents a single or combination of Ge/Si/Sn, and can be divided into three main parts: init, mix and single. The init part comes from the standard DP-GEN scheme starting from 590 structures that are generated via slightly perturbing DFT-relaxed crystal structures, Li10SiP2S12 and Li10SnP2S12 from Materials Project49. The exploration covers both ordered structures relaxed by DFT (i.e. structures downloaded from the Materials Project database, in which the position of Ge/Si/Sn/P atoms are fixed) and disordered structures whose 4d sites are randomly occupied by Ge/Si/Sn/P. Based on the init part, the mix part contains further exploration in binary and ternary mixture of Ge/Si/Sn, while the single part covers only a single X in Ge/Si/Sn with other changes in lattice and ratio of Li.

HEA systems. The High Entropy Alloy HEA dataset includes bulk TaNbWMoVAl alloy systems of various configurations and compositions. We employ DP-GEN to explore the composition space, starting from Ta3Nb3W3Mo3V3Al1, a 16-atom unit cell containing the former 5 elements as main components and Al as an additive. The dataset is divided into two subsets: interior and exterior. The interior (higher entropy) subset includes composition variations near the starting point. It covers six-component, quinary, quaternary, and ternary alloys. The exterior (lower entropy) subset includes systems that are close to the corners and edges of the composition space. It includes systems where one or two elements dominate, binary alloys, and simple substance systems. For both subsets, the temperature range is around 50.0 K to 388.1 K and the pressure range is around 1 bar to 50000 bar.

OC2047. OC20 consists of single adsorbates (small molecules) physically binding to the surfaces of catalysts covering periodic bulk materials with 56 elements. Both the chemical diversity and system size are much more complex than other benchmark datasets, such as MD1758, ANI-1x24, or QM959. OC2M is a subset including 2 million data points (energies and forces) randomly sampled from OC20, which is still challenging for model training and decent for pretraining. Johannes et al. recently provided several baselines on OC2M, taking months to converge40.

Accuracy on various datasets, trained from scratch

The majority of existing models usually focus on the ability to transfer among different configurations, in which case training and validation subsets consist of similar compositions (e.g. randomly sampled from the same dataset). However, to perform pretraining, the upstream and downstream datasets may differ violently. Thus, it’s vital for models under the pretraining scheme to transfer among different compositions or even among different datasets, which has, as far as we know, rarely been discussed before. In this work, we mainly focus on a more general but challenging scheme to comprehensively test the generalization ability of the model.

We first designed several challenging tasks to test the model’s ability to transfer among different compositions. For AlMgCu, SSE, and HEA systems, we divided them into subsets with different compositions for training and validation (See Datasets subsection for details). The results of DPA-1 and DeepPot-SE are shown in Table 1. With the training loss nearly the same (omitted in the table), the DPA-1 drastically outperforms DeepPot-SE in validation accuracy. For example, for AlMgCu systems, when trained only on single- and binary-element samples, the validation RMSE of DPA-1 on ternary samples can outperform DeepPot-SE by one order of magnitude (6.99 versus 65.1 meV/atom). This suggests that the DPA-1 model might have learned the latent interactions of ternary pairs Al-Mg-Cu from binary pairs Al-Mg, Al-Cu, Mg-Cu, and single-element interactions, possibly thanks to the type-embedding scheme and attention mechanism. We conducted an ablation study in Supplementary Note 1 on HEA systems to demonstrate the influence of each structural component.

Table 1 Validation RMSE of DPA-1 and DeepPot-SE on energy (Δ E, meV/atom) and atomic forces (Δ F, meV/Å) with different settings of the training/validation sets (See Datasets Section for details)

To test the performance of DPA-1 in terms of predicting more physical quantities, we performed geometry relaxations on all AlMgCu ternary alloys available from the Materials Project to evaluate their accuracy in predicting formation energy and equilibrium volume (see details in Supplementary Note 2). We also used it to calculate the elastic moduli of AlMgCu systems, which requires accurately capturing the second-order information (see details in Supplementary Note 3). Additionally, we carried out molecular dynamics simulations on LiGePS systems to assess the diffusion coefficients in relation to temperature, comparing the results to ab initio molecular dynamics (AIMD) simulations and experimental studies (see details in Supplementary Note 4). In all tests, satisfactory agreement with the DFT and/or experimental references are obtained.

As a supplement, we also trained DPA-1 model on several simple systems to compare with other ML-based PES. Since these tasks are much easier than the above ones and out of our main focus, we place the results in Supplementary Note 8. Note that there may be relatively little room for improvement on these simple datasets, while DPA-1 can still outperform other methods with even less training samples.

Sample efficiency of pretrained models

As shown in Fig. 2, we use the learning curves to illustrate in terms of the amount of additional training data saved for downstream tasks thanks to model pretraining. In all the experiments, the learning curves were generated by an active learning procedure, in which a pool of data labeled by energy and force is prepared and three steps are repeated iteratively: using samples in the training pool to train the model; testing the model using the remaining samples; selecting 50 samples with the largest prediction errors on per-atom energies and adding them to the training pool. We use the term sample efficiency to denote the amount of training samples required by a model to achieve a given accuracy level for a certain task. The hyperparameter settings in these tests can be found in Supplementary Note 9.

Fig. 2: Learning curves of both energy and force with DeepPot-SE and DPA-1, under different setups and on different systems.
figure 2

a Learning curves on the AlMgCu ternary subset, with DeepPot-SE and DPA-1 models, pretrained on single-element and binary subsets; Learning curves on HEA (b) and AlCu (c), with DeepPot-SE (from scratch) and DPA-1 (both from scratch and pretrained on OC2M). The red line represents the full data training baseline with DPA-1.

We started with a relatively simple task to compare DeepPot-SE and DPA-1. In this task, both the two models were pretrained using single-element and binary subsets of the AlMgCu systems, and the learning curves were obtained using the AlMgCu ternary subset. As shown in Fig. 2a, DPA-1 exhibits a much better sample efficiency than DeepPot-SE, which should be expected.

Next, we used the OC2M dataset, which contains 56 elements, to pretrain DPA-1 and evaluated its performance on the HEA systems and the AlCu systems (Fig. 2b, c, respectively). As shown in Fig. 3c, the training cost of DeepPot-SE scales quadratically with the number of elements, making its pretraining computationally infeasible, while the number of elements has no effects on the training cost of DPA-1. It is observed that the sample efficiency of DPA-1 pretrained on OC2M is generally better than DPA-1 from scratch, while DeepPot-SE from scratch is the worst. Moreover, compared with AlCu systems, the improvement of pretraining is much more significant for HEA systems, possibly due to the fact that the number of elements of HEA is much larger than AlCu, and the local chemical environment is much more complicated.

Fig. 3: Interpretability of DPA-1 pretrained on OC2M and training efficiency comparison with DeepPot-SE.
figure 3

a 3-dimensional PCA visualization of the learned type embeddings of DPA-1 pretrained on OC2M. These 56 elements are roughly arranged on a spiral in the latent space. Elements in the fourth period are connected with the red line and elements belonging to the same family are grouped by the blue dot lines. Colors on the names of the elements represent the height in z-axis. We use dashed circle to denote the hypothetical position of Li, which is not contained in OC2M. See text for discussions. b RMSE of energy and force for SSE systems given by DPA-1 pretrained on OC2M, as functions of linear interpolation coefficient \(\lambda \left(Na\right)\). Since Li is not contained in OC2M, we let \({T}_{Li}=\lambda \left(Na\right)* {T}_{Na}+\left(1-\lambda \left(Na\right)\right)* {T}_{H}\) be the interpolated type embedding of Li. The OC2M-pretrained model with this interpolation and modified energy bias is directly tested on SSE systems without further training. c Training efficiency of DPA-1 and DeepPot-SE (considering type information of both two sides) with the growing number of element types in training systems. The maximum number of neighboring atoms to be considered is set to 120 in all the experiments.

The equivariant GNN models usually need thousands of GPU hours to be trained to a descent accuracy40. By contrast, the DPA-1 model only takes less than 200 GPU hours for training. The converged energy and force MAEs on the OC2M validation set are 0.681 eV and 0.076 eV/Å, respectively. This accuracy is comparable with the best energy-conserving GNN model DimeNet++, which achieves MAEs of 0.805 eV and 0.066 eV/Å, reported in ref. 40. A better performance of energy MAE 0.286 eV and force MAE 0.026 eV/Åis achieved by GemNet-OC at the cost of non-conservative forces and loss of smoothness40.

In the potential energy model, the presence of non-conservative forces and unsmoothness introduce an artificial energy drift in MD simulations. While investigating static properties, this drift can be removed by incorporating a thermostat in the simulation. However, it is essential to carefully examine the potential impact on the accuracy of property estimation. To calculate the dynamical response of the system, such as the self-diffusion coefficient, viscosity, and heat conductivity, it is typically necessary to evaluate auto-correlation functions by using the Green-Kubo relations60,61. The estimations of auto-correlation functions usually require 10-100 ps long micro-canonical (NVE) simulations to achieve converged statistics and eliminate possible nonergodicity in the Hamiltonian dynamics62. In this context, energy conservation is critical; otherwise, the energy drift may lead the system to an undesired thermodynamic state or even cause a blow-up in the total energy. In Supplementary Note 5, we demonstrate the magnitude of the total energy drift during a 100-ps long NVE MD simulation for OC20 configurations. The drift observed in non-conservative models is approximately 10−2 eV/atom, which corresponds to a temperature of roughly 102 K. In contrast, the energy-conserving DPA-1 model, as anticipated, does not exhibit any energy drift.

As shown in Supplementary Note 6, it has been observed that, when trained with 1 million steps on the AlMgCu alloy dataset, the non-conservative models achieve relatively higher force accuracy but lower energy accuracy compared to the conservative models. Furthermore, the accuracy of non-conservative models in predicting the equation of state (EOS), a fundamental material property, is lower than that of the conservative models. This may be attributed to the fact that non-conservative models predict energy and force separately, and thus accurate force prediction does not necessarily improve the shape of the energy landscape.

Interpretability of type embedding learned from pretraining

To see whether DPA-1 can learn physically meaningful information from pretraining, we investigated the 3-dimensional principal component analysis (PCA) visualization of the learned type embeddings in the OC2M-pretrained model. Interestingly, as shown in Fig. 3a, the arrangement of the elements generally follows the shape of a downward spiral. Elements belonging to the same period are lined up in the direction of the spiral; while elements belonging to the same family are listed in the direction orthogonal to the spiral. Even though some transition metal elements are almost bound together, this rule still roughly holds. It is observed that C, N, and O are outliers, possibly because in OC2M, C, N and O are mostly in organic molecules, which serve as adsorbates and have chemical environments that are very different from other elements.

In addition, we performed interpolation experiments for the type embedding of Li, an element unseen in OC2M. As shown in Fig. 3b, we let \({T}_{Li}=\lambda \left(Na\right)* {T}_{Na}+\left(1-\lambda \left(Na\right)\right)* {T}_{H}\), since Li lies between H and Na in the same family. When tested on the SSE system, only the bias in the atomic energy is changed, since the setup of the electronic method used to label the SSE system is different from that for OC2M, which typically causes an energy shift. It is found that the RMSE of energy and force shows a sudden drop when \(\lambda \left(Na\right)=0.7\), which meets the chemical intuition and further confirms the interpretability of the pretrained DPA-1 model. Moreover, we conducted analogous interpolation experiments for Nb and Mo on the HEA systems, and reached similar conclusions as the Li interpolation (see detailed report in Supplementary Note 7).

Discussion

In this paper, we developed DPA-1, an attention-based Deep Potential model that allows for large-scale pretraining on atomistic datasets. We tested DPA-1 from different aspects, showing its excellent performance in terms of its accuracy on various datasets when trained from scratch, as well as its sample efficiency when pretrained with existing data. Further investigations on the type embedding parameters suggest the interpretability of DPA-1 pretrained on OC2M.

In the future, it will be of interest to extend the training dataset to cover the full periodic table, and, in particular, see a more converged “spiral” in the latent space; the embedding information of local chemical environments may be useful to characterize different conformations. Multi-task and unsupervised training schemes are worth exploring; and, for downstream tasks, just like what has happened in the fields of CV and NLP, schemes like model compression, distillation, and transfer, etc., are desperately needed. We leave these possibilities and more applications to future works.

Methods

Consider a system of N atoms, the elemental types are \({{{\mathcal{A}}}}=\left\{{\alpha }_{1},{\alpha }_{2},...,{\alpha }_{i},...,{\alpha }_{N}\right\}\), and the atomic coordinates are \({{{\mathcal{R}}}}=\left\{{{{{\boldsymbol{r}}}}}_{1},{{{{\boldsymbol{r}}}}}_{2},...,{{{{\boldsymbol{r}}}}}_{i},...,{{{{\boldsymbol{r}}}}}_{N}\right\}\), with ri being the three Cartesian coordinates of atom i. The PES of the system is denoted by E, a function of elemental types and coordinates, i.e. \(E=E({{{\mathcal{A}}}},{{{\mathcal{R}}}})\). For each atom i, consider its neighbors \(\{j| j\in {{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\}\), where \({{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\) denotes the set of atom indices j such that rji < rc, with rji being the Euclidean distance between atoms i and j. E is represented as the summation of atomic energies \(\left\{{e}_{1},{e}_{2},...,{e}_{i},...,{e}_{N}\right\}\), where the atomic energy ei only depends on the information of \({{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\). We define \({N}_{i}=| {{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)|\), the cardinality of the set \({{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\). We use \({{{{\mathcal{A}}}}}^{i}\) to denote element types in \({{{{\mathcal{N}}}}}\!_{{r}\!_{c}}(i)\), and \({{{{\mathcal{R}}}}}^{i}\in {{\mathbb{R}}}^{{N}_{i}\times 3}\) their corresponding coordinates relative to i. The atomic energy ei is thus a function of \({{{{\mathcal{A}}}}}^{i}\) and \({{{{\mathcal{R}}}}}^{i}\). The atomic force on atom i, \({{{{\mathcal{F}}}}}_{i}\), is defined as the negative gradient of the total energy with respect to i’s coordinate:

$${{{{\mathcal{F}}}}}_{i}=-{\nabla }_{{{{{\boldsymbol{r}}}}}_{{{{\boldsymbol{i}}}}}}E.$$
(1)

We refer to ref. 7 for a detailed discussion of several requirements for PES modeling. In particular, the PES has to be invariant under translation, rotation, and permutation of the indices of atoms with the same element types.

The details of the model architecture are introduced below. We refer to Fig. 1 for the overall pipeline to predict the atomic energy ei: from the embedded neighboring environment, through the self-attention scheme, to the symmetry-preserving descriptors, and finally to the fitting network.

Local embedding matrix with type information

We obtain the local embedding matrix with the following three steps. First, \({{{{\mathcal{R}}}}}^{i}\) is mapped to the generalized coordinates \({\tilde{{{{\mathcal{R}}}}}}^{i}\in {{\mathbb{R}}}^{{N}_{i}\times 4}\). In this mapping, each row of \({{{{\mathcal{R}}}}}^{i},\{{x}_{ji},{y}_{ji},{z}_{ji}\}\), is transformed into a row of \({\tilde{{{{\mathcal{R}}}}}}^{i}\):

$$\{{x}_{ji},{y}_{ji},{z}_{ji}\}\mapsto \{s({r}_{ji}),{\hat{x}}_{ji},{\hat{y}}_{ji},{\hat{z}}_{ji}\},$$
(2)

where \(\{{x}_{ji},{y}_{ji},{z}_{ji}\}\) denotes the Cartesian coordinates of rji = rj − ri, \({\hat{x}}_{ji}=\frac{s({r}_{ji}){x}_{ji}}{{r}_{ji}},{\hat{y}}_{ji}=\frac{s({r}_{ji}){y}_{ji}}{{r}_{ji}},{\hat{z}}_{ji}=\frac{s({r}_{ji}){z}_{ji}}{{r}_{ji}}\), and \(s({r}_{ji}):{\mathbb{R}}\,\mapsto\, {\mathbb{R}}\) is a continuous and differentiable scalar weighting function applied to each component, defined as:

$$s({r}_{ji})=\left\{\begin{array}{ll}\frac{1}{{r}_{ji}} &{r}_{ji}\, <\, {r}_{cs}\\ \frac{1}{{r}_{ji}}\left[{u}^{3}\left(-6{u}^{2}+15u-10\right)+1\right] &{r}_{cs}\le {r}_{ji}\, <\, {r}_{c},\quad u=\frac{{r}_{ji}-{r}_{cs}}{{r}_{c}-{r}_{cs}}.\\ 0 &{r}_{c}\,\le\, {r}_{ji}\end{array}\right.$$
(3)

Here rcs is a smooth cutoff parameter that allows the components in \({\tilde{{{{\mathcal{R}}}}}}^{i}\) to smoothly go to zero at the boundary of the local region defined by rc.

Second, we add the atomic type embedding as supplemental information. For atom i, the type embedding map Ti is defined as:

$${T}_{i}={\phi }_{T}({\alpha }_{i}),$$
(4)

where αi is the atomic type of atom i and ϕT is a one-hot-like embedding network mapping from αi to a length-fixed vector.

Then, given both \({\tilde{{{{\mathcal{R}}}}}}^{i}\) and type embeddings \(\{{T}_{i}\}\cup \{{T}_{j}| j\in {{{{\mathcal{N}}}}}_{{r}_{c}}(i)\}\), we define the local embedding matrix \({{{{\mathcal{G}}}}}^{i}\in {{\mathbb{R}}}^{{N}_{i}\times {M}_{1}}\):

$${\left({{{{\mathcal{G}}}}}^{i}\right)}_{j}=G(s({r}_{ji}),{T}_{i},{T}_{j}),$$
(5)

where G is a neural network mapping from scalar weight \(s({r}_{ji})\) and type embeddings of both center and neighbor atoms, through multiple hidden layers, to M1 outputs. Here we simply feed the concatenated inputs into G at once, as shown in Fig. 1b.

Attention method for building up trainable descriptors

The attention mechanism has achieved great success and played an increasingly important role in CV63 and NLP64. It has become an excellent tool for modeling the importance or relevance of visual regions or text tokens, thus is potentially appropriate to reweight the interactions among neighbor atoms according to both distance and angular information.

In DPA-1, we follow the standard self-attention mechanism and obtain the queries \({{{{\mathcal{Q}}}}}^{i,l}\in {{\mathbb{R}}}^{{N}_{i}\times {d}_{k}}\), keys \({{{{\mathcal{K}}}}}^{i,l}\in {{\mathbb{R}}}^{{N}_{i}\times {d}_{k}}\), and values \({{{{\mathcal{V}}}}}^{i,l}\in {{\mathbb{R}}}^{{N}_{i}\times {d}_{v}}\):

$$\begin{array}{r}{\left({{{{\mathcal{Q}}}}}^{i,l}\right)}_{j}={Q}_{l}\left({\left({{{{\mathcal{G}}}}}^{i,l-1}\right)}_{j}\right),\\ {\left({{{{\mathcal{K}}}}}^{i,l}\right)}_{j}={K}_{l}\left({\left({{{{\mathcal{G}}}}}^{i,l-1}\right)}_{j}\right),\\ {\left({{{{\mathcal{V}}}}}^{i,l}\right)}_{j}={V}_{l}\left({\left({{{{\mathcal{G}}}}}^{i,l-1}\right)}_{j}\right),\end{array}$$
(6)

where Ql, Kl, Vl represent three linear transformations which output the queries and keys of dimension dk and values of dimension dv, and l is the index of attention layer. Here we take \({{{{\mathcal{G}}}}}^{i,0}={{{{\mathcal{G}}}}}^{i}\).

Then we adopt the scaled dot-product attention method65 to mix the neighbor features after calculating the attention weights:

$$A({{{{\mathcal{Q}}}}}^{i,l},{{{{\mathcal{K}}}}}^{i,l},{{{{\mathcal{V}}}}}^{i,l},{{{{\mathcal{R}}}}}^{i,l})=\varphi \left({{{{\mathcal{Q}}}}}^{i,l},{{{{\mathcal{K}}}}}^{i,l},{{{{\mathcal{R}}}}}^{i,l}\right){{{{\mathcal{V}}}}}^{i,l},$$
(7)

where \(\varphi \left({{{{\mathcal{Q}}}}}^{i,l},{{{{\mathcal{K}}}}}^{i,l},{{{{\mathcal{R}}}}}^{i,l}\right)\in {{\mathbb{R}}}^{{N}_{i}\times {N}_{i}}\) is attention weights. In the original attention method, one typically has \(\varphi \left({{{{\mathcal{Q}}}}}^{i,l},{{{{\mathcal{K}}}}}^{i,l}\right)={{\mathrm{softmax}}}\,\left(\frac{{{{{\mathcal{Q}}}}}^{i,l}{({{{{\mathcal{K}}}}}^{i,l})}^{T}}{\sqrt{{d}_{k}}}\right)\), with \(\sqrt{{d}_{k}}\) being the normalization temperature. This is slightly modified to better incorporate the angular information:

$$\varphi \left({{{{\mathcal{Q}}}}}^{i,l},{{{{\mathcal{K}}}}}^{i,l},{{{{\mathcal{R}}}}}^{i,l}\right)={{\mathrm{softmax}}}\,\left(\frac{{{{{\mathcal{Q}}}}}^{i,l}{({{{{\mathcal{K}}}}}^{i,l})}^{T}}{\sqrt{{d}_{k}}}\right)\odot {\hat{{{{\mathcal{R}}}}}}^{i}{({\hat{{{{\mathcal{R}}}}}}^{i})}^{T},$$
(8)

where \({\hat{{{{\mathcal{R}}}}}}^{i}=\frac{{{{{\mathcal{R}}}}}^{i}}{\parallel {{{{\mathcal{R}}}}}^{i}{\parallel }_{2}}\in {{\mathbb{R}}}^{{N}_{i}\times 3}\) denotes normalized relative coordinates and means element-wise multiplication. Intuitively, in the neighborhood of center atom i, neighbor atom k may be highly correlated with j when both the relative distance attention \({({{{{\mathcal{Q}}}}}^{i,l})}_{j}{({{{{\mathcal{K}}}}}^{i,l})}_{k}^{T}\) and normalized product of relative coordinates \(\frac{{{{{\bf{r}}}}}_{ji}{({{{{\bf{r}}}}}_{ki})}^{T}}{{r}_{ji}{r}_{ki}}\) have high scores.

Then we add layer normalization in a residual way to finally obtain the self-attentioned local embedding matrix \({\hat{{{{\mathcal{G}}}}}}^{i}\) in one such attention layer:

$${{{{\mathcal{G}}}}}^{i,l}={{{{\mathcal{G}}}}}^{i,l-1}+{{{\rm{LayerNorm}}}}(A({{{{\mathcal{Q}}}}}^{i,l},{{{{\mathcal{K}}}}}^{i,l},{{{{\mathcal{V}}}}}^{i,l},{{{{\mathcal{R}}}}}^{i,l})).$$
(9)

We also tried other attention-related tricks such as pre-layer normalization, multi-head attention, etc., which brought little improvement. In practice, as shown in Fig. 1c, we repeated this procedure by l(l ≥ 2) times for a more complete representation. If not stated otherwise, we use l = 2 in the following sections of the work. Next, we define the encoded feature matrix \({{{{\mathcal{D}}}}}^{i}\in {{\mathbb{R}}}^{{M}_{1}\times {M}_{2}}\) of atom i:

$${{{{\mathcal{D}}}}}^{i}={({\hat{{{{\mathcal{G}}}}}}^{i})}^{T}{\tilde{{{{\mathcal{R}}}}}}^{i}{({\tilde{{{{\mathcal{R}}}}}}^{i})}^{T}{\dot{{{{\mathcal{G}}}}}}^{i},$$
(10)

where \({\dot{{{{\mathcal{G}}}}}}^{i}\) stands for a sub-matrix of \({\hat{{{{\mathcal{G}}}}}}^{i}\), which takes the first M2(<M1) columns of \({\hat{{{{\mathcal{G}}}}}}^{i}\). Here the feature matrix \({{{{\mathcal{D}}}}}^{i}\), i.e. the descriptor, preserves all the invariance mentioned above, of which the proof can be found in ref. 7. We then pass the reshaped \({{{{\mathcal{D}}}}}^{i}\), concatenated with the type embedding parameters of the center atom, through the multi-layer fitting network:

$${e}_{i}=e\left({{{{\mathcal{D}}}}}^{i},{T}_{i}\right).$$
(11)

The total energy of the system is then given as the summation of ei, and the atomic force \({{{{\mathcal{F}}}}}_{i}\) can be further computed via Eq. (1).

Model (pre-)training and finetuning

For model training or pretraining, we adopted the Adam stochastic gradient descent method66 on all the trainable parameters w inside the model to minimize the loss:

$${{{{\mathcal{L}}}}}_{{{{\boldsymbol{w}}}}}({E}^{{{{\boldsymbol{w}}}}},{{{{\mathcal{F}}}}}^{{{{\boldsymbol{w}}}}})=\frac{1}{| {{{\mathcal{B}}}}| }\mathop{\sum}\limits_{t\in {{{\mathcal{B}}}}}\left({p}_{\epsilon }{\left\vert {E}_{t}-{E}_{t}^{{{{\boldsymbol{w}}}}}\right\vert }^{2}+{p}_{f}{\left\vert {{{{\mathcal{F}}}}}_{t}-{{{{\mathcal{F}}}}}_{t}^{{{{\boldsymbol{w}}}}}\right\vert }^{2}\right).$$
(12)

Here \({{{\mathcal{B}}}}\) represents a minibatch, \(| {{{\mathcal{B}}}}|\) is the batch size, t denotes the index of the training sample. \({E}^{{{{\boldsymbol{w}}}}},{{{{\mathcal{F}}}}}^{{{{\boldsymbol{w}}}}}\) denote the model outputs and \(E,{{{\mathcal{F}}}}\) are the corresponding DFT results. We also adopted a scheduler to tune the prefactors pϵ and pf during the training process to make a better balance between energy and force labels. Virial errors, which are omitted here, can be added to the loss for training if available.

To finetune the pretrained model with a new dataset, we first change the energy bias in the last layer of the pretrained model with the new statistical results of the new dataset, and then we fix part of the parameters in the pretrained model and train the remaining. For the following experiments, we obtained the best performance when only the type embedding parameters are fixed.