Introduction

The accelerated design and characterization of materials of technological interest has been a rapidly evolving area of research in the last few decades1. Materials design requires approaches spanning a variety of length and time scales2. For atomistic design, the methods employed may include computational approaches such as density functional theory, tight-binding, and force-fields, and highly accurate approaches such as quantum Monte Carlo or quantum computations. A wide range of approaches are employed above the purely atomistic level, such as mesoscale and finite-element methods3. Similarly, experimental characterization approaches include X-ray diffraction, vibroscopy, manometry, scanning electron microscopy, and magnetic susceptibility measurements.

Moreover, data produced from these techniques can be of various types: chemical formulae, atomic/micro-structures, images, spectra, and text-documents4,5,6. The data analysis and curation methods add further complexity to benchmarking efforts, which are extremely important7,8,9,10,11,12,13,14,15,16,17,18. For example, more than 70 % of research works were shown to be non-reproducible19,20,21, and this number could be much higher depending upon the field of investigation. Although there have been significant advances in individual fields, there is an urgent need to establish a large-scale benchmark for systematic, reproducible, transparent, and unbiased scientific development.

Developing such metrology is a highly challenging task, even for one of these methods, let alone the entire galaxy of available methods. Projects and approaches, such as the materials genome and FAIR initiatives1,22, have resulted in several well-curated datasets and benchmarks. These, in turn, have led to several materials informatics applications23,24,25,26. Although electronic structure approaches such as density functional theory (DFT) tend to be more reproducible than other categories16,27, a systematic effort must be made to validate these methods and estimate the error in predictions. Hence, it is highly desirable to have a large-scale benchmarking platform in the materials science field for reproducibility and method validation.

Massive progress in fields such as image recognition/image classification (ImageNet28), protein structure prediction (AlphaFold29), large language modeling (Generative pretrained transformers (GPT))30) has been possible primarily because of well-defined benchmarks in respective fields. With regards to AI methods for structure-to-property predictions31, benchmarking efforts have enabled drastic improvements in the accuracy of predicted properties (i.e., moving away from descriptor-based predictions and including graph neural networks in the model architectures to improve accuracy).

For deterministic electronic structure methods such as DFT, extensive benchmarking of software and different DFT approximations (functionals, pseudopotentials, etc.) has led to increased reproducibility and precision in individual results and workflows27,32. Such benchmarks allow a wide community to solve problems collectively and systematically. In addition, since there already exists highly accurate models for specific tasks (i.e., energy prediction), more comprehensive evaluations of the models are required so that the performance ranking is not overfitted to one biased data source. We believe that such a universal and large-scale set of benchmarks for materials science will significantly benefit the scientific community.

To this date, several benchmarks of individual methods have already been developed. For artificial intelligence (AI) methods, there have been several benchmarks and leaderboards, such as MatBench33. MatBench provides a leaderboard for machine-learned structure-based property predictions of inorganic materials using 13 supervised machine learning tasks (thermodynamic, tensile, optical, thermal, elastic, and electronic properties) from 10 datasets (including DFT and experiment)33. Similar AI benchmarking and leaderboard platforms include MoleculeNet34, OpenCatalystProject35, sGDML36,37, mLEARN38, MatScholar39, and AtomAI40. For electronic structure methods, some of the notable benchmarks include the work by Lejaeghere et al.27, Borlido et al.41, Huber et al.42, Zhang et al.43, Tran et al.44 and several other projects45,46,47,48. Other method benchmarks include phase-field benchmarks by Wheeler et al.49, Lindsay et al.50, and microscopy benchmarks such as by Wei et al.51. A few additional benchmarking studies in materials science include refs. 52,53,54,55,56,57,58,59,60,61,62,63,64. More details on some of these benchmarking efforts are provided in later sections.

The goal of this project is to provide a more comprehensive framework for materials benchmarking than previous works. In particular, most existing efforts: 1) lack the flexibility to readily incorporate new tasks or benchmarks, which is a limitation given the continuous discovery of new materials and quantities in science, 2) are specialized towards a single modality, such as electronic structure, rather than providing a comprehensive framework that can accommodate multiple modalities, 3) offer only a limited set of tasks or properties, 4) are primarily focused on computational methods, overlooking the importance of experimental benchmarking, and 5) make adding contributions to existing platforms rather complex, creating a barrier to entry. In general, there is a need to simplify the process of user contributions to leaderboards to foster broader community engagement.

In this work, we present a user-friendly, comprehensive approach to integrate the benchmarking of both computational, experimental and data-analytics methods. The JARVIS-Leaderboard framework (https://pages.nist.gov/jarvis_leaderboard/) covers a variety of categories: Artificial Intelligence (AI), Electronic Structure (ES), Force-field (FF), Quantum Computation (QC), and Experiments (EXP). It also covers various data types, including atomic structures, spectra, images, and text. This project can be used to: (1) check the state-of-the-art methods in respective fields, (2) add a contribution model on an existing benchmark, (3) add a new benchmark, (4) compare new ideas and approaches to well-known approaches. To enhance reproducibility, we encourage each contribution to (1) be from peer-reviewed articles with an associated DOI for all contributions, models, and tools, (2) include a run script to exactly reproduce the results (especially for computational tools), (3) include a metadata file with details such as team name, contact information, computational timing and software (with software version)/hardware used in order to enhance transparency.

It is important to note differences between a typical data repository and a benchmarking platform. Some of the key distinguishing factors between a usual large data repository (such as JARVIS-DFT) and the present leaderboard effort are: 1) the leaderboard contains well-characterized/well-known samples/tasks (i.e., with digital object identifier/peer-reviewed article links) with all the scripts/metadata easily available to reproduce the results rather than just being a look-up table to find data, 2) large data repositories usually contain more variation in materials chemistry/structure and less variation of methods while the leaderboard focuses on a larger number of method comparisons.

For example, the JARVIS-DFT contains DFT data for more than 80,000 materials and millions of material properties with a few specific ES methods and hence there are only a few entries for, say, the electronic bandgap of Silicon from different methods, while the leaderboard contains electronic bandgaps for Silicon using more than 17 ES methods from various contributors. Similarly, JARVIS-ALIGNN project contains AI models for more than 80 properties/tasks of materials, i.e., just one model for a well-known property such as formation energy, while there are more than 12 methods for formation energy task in the leaderboard (as discussed later).

Furthermore, the JARVIS-leaderboard attempts to bridge together multiple categories of methods (AI, ES, FF, QC, EXP) and types of data (single properties, structure, spectra, text, etc.) with the goal of broadening benchmarking efforts across several fields of study. What differentiates the JARVIS-Leaderboard from platforms such as MatBench33, is that MatBench33 provides a handful of tasks to evaluate ML methods on larger datasets (i.e. 104 entries, most of which are from the Materials Project65). A potential drawback of this approach is that the resulting performance rankings could be biased towards the data distribution of a single source. In contrast, the JARVIS-Leaderboard covers a broader range of datasets and properties and provides a better overview of model performance.

Recently in the field of machine learning in materials science, there has been a fixation on performance metrics for newly developed models. This begs the question of whether or not benchmarking can be destructive to the development of new methods if these new methods cannot immediately outperform the previous state-of-the-art approaches. This also begs the question of whether or not benchmarking can lead to overfitting or poor generalization66,67.

Therefore, we outline how the leaderboard can also be used to identify and focus on some of the major challenges in different fields, such as: (1) how to evaluate extrapolation capability68? (2) why is it difficult to develop a reasonably good AI model with similar accuracy to electronic structure methods?, (3) how can we reduce the computational cost of higher accuracy electronic structure predictions (such as bandgaps and bandoffsets)?, (4) how do we identify examples of materials that require high-fidelity methods (beyond DFT accuracy)?, (5) how can we identify material space where methodological improvements need to be targeted?, (6) how can we establish figures of merits for mesoscale and device-scale models such as phase field and technology computer-aided design (TCAD)?, (7) how can we make atomistic image analysis quantitative rather than qualitative?, (8) and how do we develop and benchmark multi-modal models (such as text, image, video, atomic structures, etc.)69?

The JARVIS-Leaderboard is seamlessly integrated into the existing and well-established NIST-JARVIS infrastructure70,71, which hosts several datasets, tools, applications, and tutorials for materials design, motivated by the materials genome initiative1. The framework is open access to the entire materials science community for progressing the field collectively and systematically. JARVIS (Joint Automated Repository for Various Integrated Simulations)70,71 is a repository designed to automate materials discovery and optimization using classical force-field, density functional theory, machine learning calculations, and experiments. Nevertheless, the leaderboard is not limited to NIST-JARVIS infrastructure and can be linked with other external projects as well.

Since its creation in 2017, JARVIS has had over 50,000 users worldwide, over 45 JARVIS-associated articles have been published, and over 80,000 materials currently reside in the database. As these numbers continue to multiply, significant effort on external outreach to the materials science community has been an additional goal of JARVIS, with several events (https://jarvis.nist.gov/events/) such as the Artificial Intelligence for Materials Science (AIMS) and Quantum Matters in Materials Science (QMMS) workshops and hands-on JARVIS-Schools, which have had hundreds of participants throughout the last few years. Based on the level of success and support from the community with regard to the existing JARVIS infrastructure, we believe that the integration of the JARVIS-Leaderboard will have a similar level of engagement and success, with a growing number of contributors from all over the world (in government, academia and industry) and in different sub-fields of materials science.

Results and discussion

Leaderboard overview

At the homepage, information regarding the number of methods, benchmarks, contributions, and datapoints are provided. A snapshot of the homepage with various categories is shown in Fig. 1a. Clicking on one of the entries (or searching in the ’Search’ box) such as “formation_energy_peratom” opens a new tab with available contributions. This new tab consists of 1) a description of the benchmark, 2) a plot of various available contributions (as shown in Fig. 1b), 3) explicit table for the plot (as shown in Fig. 1c). For each contribution, links are provided to the submitted data (in .csv.zip format), reference benchmark data (in JSON file), a shell script to reproduce the contribution (run.sh file) and metadata file (metadata.json). The metadata file contains details about the team name, the electronic mail address of the contributor(s), DOI number, software (with software version), hardware, instrument, computational timing and other relevant details of a benchmark.

Fig. 1: Leaderboard snapshot with an example output for AI-based formation energy per atom model on the JARVIS-DFT (dft_3d) dataset.
figure 1

a Homepage sanpshot showing list of categories and number of available contributions at the time of writing, b an example AI regression model benchmark for formation energy with several contributions. The methods are sorted based on the mean absolute error (MAE) values. Lower MAE values indicate higher accuracy, c explicit table for the plot in panel b.Links to individual csv.zip (AI-SinglePropertyPrediction-formation_energy_peratom-dft_3d-test-mae.csv.zip), json.zip (dft_3d_formation_energy_peratom.json.zip), shell script (run.sh) and detailed info (metadata.json) files are provided to help enhance reproducibility. Such results plots and tables are available for each benchmark in the leaderboard.

There are several categories for the benchmarks, including AI, ES, QC, FF, and EXP and their combinations. Some example contributions and a summary table are also provided on the webpage to help a user navigate through the project. The summary table breaks down the available information into categories and sub-categories of different methodologies.

JARVIS-Leaderboard is an evolving project, so additions to the project are anticipated, welcome, and easy to make. We show a general flowchart for adding a new benchmark to the leaderboard in Fig. 2. The user can populate the reference dataset (with well-defined data splits) used for a specific benchmark (e.g., for 2D exfoliation energies in JARVIS-DFT dataset using an AI method: “AI-SinglePropertyPrediction-exfoliation_energy-dft_3d-test”). AI benchmarks have pre-defined training/validation/test identifiers and target data in a corresponding json.zip file, while other methods have only reference test set for evaluation because they do not require model training like an AI method does. For most benchmarks in the leaderboard, experimental data is used as the reference data.

Fig. 2: A flow-chart showing the processes involved in uploading a new contribution to the leaderboard.
figure 2

The jarvis_populate_data.py scripts generate a benchmark dataset. A user can apply their method, train models, or run experiments on that dataset and prepare a csv.zip, a metadata.json file, and other files in a new folder in the contributions directory. The contributions can be locally checked by the user using jarvis_server.py script. Then the folder can be uploaded to a user’s GitHub account by the automated jarvis_upload.py script involving several GitHub uploading steps. The administrators of the JARVIS-Leaderboard at NIST will verify the contributions and then finally, it will become part of the leaderboard website.

There is a helper script jarvis_populate_data.py to generate a benchmark dataset. A user can apply their method, train models, or run experiments on that dataset and prepare a csv.zip file, a metadata.json file, and also if possible, a conda environment.yaml/Nix/Dockerfile and a run.sh file. This step helps to reproduce the benchmark. These files are kept in a folder with the name of the folder as the team name and can be uploaded to a user’s GitHub account by the automated jarvis_upload.py script. This script automatically forks the parent usnistgov/jarvis_leaderboard repo for the user, adds the team-name folder with its files in that forked repo, runs a few minimal sanity checks on the new contribution, and then makes a pull request to the parent repo. The contribution addition and automated testings are carried out using GitHub actions. The administrators of the JARVIS-Leaderboard at NIST will verify the contributions and then finally, it will become part of the leaderboard website.

This project is available on GitHub at: https://github.com/usnistgov/jarvis_leaderboard. The administrators of the JARVIS-Leaderboard at NIST will fully oversee the upload of contributions and benchmarks. A tree structure of the repo is shown in Fig. 3. There are two main directories in the repo: (1) benchmarks (reference) and (2) leaderboard contributions (for various leaderboard entries), as shown by the green highlighted boxes in Fig. 3.

Fig. 3: A tree diagram for directory and file structure in the leaderboard.
figure 3

There are two main directories in the repo: (1) benchmarks (reference) and (2) leaderboard contributions (for various leaderboard entries). In the “benchmarks” directory, there are folders for the AI, ES, QC, FF, and EXP categories. Within them, there are sub-folders for specific sub-categories. In the “contributions” directory there is a collection of folders that consists of .csv.zip, metadata.json files, and optionally a Dockerfile and run.sh file for available contributions from each method. The csv.zip file contains entries of identifier (id) and corresponding prediction values as obtained by the corresponding model/method. These test identifiers (such as JVASP-1408) must match the test set ids in the json.zip file in the benchmarks folder for the metric measurements to work.

The “benchmarks” directory has folders for the AI, ES, QC, FF, and EXP categories. Within them, there are sub-folders for specific sub-categories such as (1) SinglePropertyPrediction (where the output of a model/experiment is one single number for an entry), (2) SinglePropertyClass (where the output is class-ids, i.e., 0,1,.. instead of floating values), (3) ImageClass (for multi-class image classification), (4) TextClass (for multi-label text classification), (5) MLFF (machine learning force-field), (6) Spectra (for multi-value data) and (7) EigenSolver (for Hamiltonian simulation). In each of these sub-folders, there are .json.zip files with well-defined reference datasets and available properties as also available in the JARVIS-Tools package https://pages.nist.gov/jarvis/databases/. To avoid storage of large files in the GitHub repo, the actual datasets are part of JARVIS-Tools and are stored in the Figshare repository with specific DOIs and version numbers.

Next, in the “contributions” directory, there is a collection of folders that consist of .csv.zip, metadata.json files, and optionally a Dockerfile and run.sh file. The csv.zip file contains identifier (id) entries and corresponding prediction values obtained by the corresponding model/method. These test identifiers (such as JVASP-1408 in Fig. (3)) must match the test set IDs in the json.zip file in the benchmarks folder for the metric measurements to work. Each of the csv.zip files must contain six components in the filename to place the contribution in the appropriate webpage. The components are the categories (such as AI), sub-categories (such as ImageClass), property (such as bravais_lattice), dataset-name (such as stem_2d_image as available in the JARVIS-Tools database page), and data-split. For entries in the AI category, the data is in train-validation-test splits (using a fixed random number generator). For the current leaderboard format, we report the performance accuracy in the test set only. These files can be easily edited with common text editors. Each contribution folder (e.g. alignn-model) consists of one or several csv.zip files corresponding to each benchmark (such as for formation energies, bandgap, etc.).

Model-specific details are kept in the metadata.json file with required keys such as model_name, project_url, team_name and an email address. Users can keep other data such as the uncertainty, time taken, and instrument/software/hardware used in the metadata file as well. For computational models, the run.sh script can be used to reproduce the contributions completely as a single command line script or job submission script. If a method requires additional steps or details beyond a simple command line script, a user can upload a README file containing the additional details. For enhanced reproducibility, we also optionally allow users to include a Dockerfile and an ipython/Google-colab notebook for each benchmark. These notebooks can be used to run the contributions in the Google-cloud without downloading anything locally.

In addition, there is a “docs” directory in the JARVIS-leaderboard. The docs folder consists of a directory structure that is similar to the benchmarks folder with categories names (AI, ES, etc.), and sub-categories (such as SinglePropertyPrediction, ImageClass etc.) with markdown (.md) files that will be converted automatically into corresponding html pages for the website. For each benchmark (i.e., json.zip file), a corresponding docs entry (i.e., md file) should be present. A new benchmark must be associated with a peer-reviewed article and a DOI, in order to have trust in the reference benchmark data. A new benchmark must also be verified by the JARVIS-Leaderboard administrators.

As mentioned above, there already exist several other materials science-specific benchmarks. We compare some of these benchmarks in Table 1 based on the categories that are included. We find that there is no single, large-scale benchmark encompassing the various fields as in the JARVIS-Leaderboard. Also, the data format, metadata, and website for these different leaderboards vary significantly. Hence, having a uniform way to compare different methods would greatly help the materials community.

Table 1 Comparison of benchmark infrastructure available for materials design methods for several categories

Benchmarks

The benchmarks consist of experimental data, density functional theory, or numerical solutions that are well-known and have already been published in peer-reviewed articles or books. A benchmark should be considered the “ground truth” for a particular task. Therefore, it is mandatory to have a digital object identifier (DOI) for each benchmark from a peer-reviewed article. There can be multiple contributions from different models or experiments for a benchmark, e.g., contributions from various DFT functionals in predicting the electronic bandgap of silicon with respect to experimental data. Typically, for electronic structure (ES) method-based contributions, the benchmarks are experimental data; for artificial intelligence (AI) methods, they are the test split; for force-field (FF) methods, they are electronic structure data; for quantum computation (QC), they are analytical results; and for experiments (EXP), they are other experiments. Currently, we have more than 270 benchmarks in the leaderboard. The JARVIS-Leaderboard flexible and dynamic nature allows the addition of new benchmarks as well.

Each entry in the benchmark dataset consists of a unique identifier. Most of these datasets are integrated into JARVIS-Tools databases page already (but not limited by it), with an associated JARVIS ID number (JID) and are backed up in Figshare, Google Drive and NIST-internal storage systems. The number of entries can vary from a few (which is especially applicable for experimental and high-accuracy computational methods, where generating a very large dataset is not feasible in terms of time and resources) to hundreds of thousands of entries in a dataset.

An overview of the dataset can be found in Fig. 4. Considering all possible entries in the dataset, we have close to 7 million datapoints. For example, an atomic structure can have multiple properties calculated, such as bandgaps and formation energies, among other properties. We find the JARVIS-DFT-3D dataset to have the largest number of entries. Considering unique systems, we can find the distribution in Fig. 4b). In this case, qe-tb (fitting dataset for ThreeBodyTB.jl72) is one of the largest datasets available in the leaderboard. Note that these datasets contain all varieties of data modalities, such as atomic structure, images, spectra, and text. In Fig. 5, we show the fractional distribution of periodic table elements in the entire dataset. We find that the most common elements are C, N, O, Cu, which is similar to the natural abundance of these elements.

Fig. 4: Distribution of data in each dataset.
figure 4

a all entries in leaderboard, b entries with unique identifiers. Note that one identifier (such as JVASP-1002 for silicon) can have multiple properties (such as bandgap, bulk modulus etc.). A script to generate this figure is also provided on the leaderboard website as the leaderboard is continuously evolving.

Fig. 5: Periodic table element distribution for entries in all the datasets.
figure 5

This is calculated by taking into account all the element specific entries normalized by total entries i.e. these are percentage probabilities.

Experimental results are uploaded as benchmarks (i.e. what is regarded as the reference). In the absence of experimental data, high-fidelity computational methods can be used as a reference. If there are multiple experimental measurements available in the literature, each can be individually added as separate benchmarks (i.e., different json.zip files to distinguish one benchmark from another) and users can submit contributions for each of them. As time and the materials science field progresses, certain experimental data may need to be revisited (i.e. more accurate measurements in the future or results are reported that contradict previous experimental data). As a response to this, separate reference (experimental) benchmarks can be added, and users will be able to plot and compare the evolution of these benchmarks over time.

In addition, Leaderboard users can raise an issue on GitHub pertaining to reference benchmarks. The administrators will also upload a README file, which contains additional information about the experiments conducted, including associated DOI, the experimental conditions and provide details if additional experiments conducted on the same material/property exist in the literature. The experimental conditions described in the README file can be important when comparing the reference benchmark to calculated results, which may be in different conditions than the experiment (i.e., the bandgap of a material is never measured at 0 K, as DFT predicts).

Contributions to the leaderboard in the form of user-submitted experimental data can be compared with previous experiments, electronic structure methods or other numerical results. ES-based contributions are benchmarked against experimental results and can be compared with other ES methods. QC data can be compared with classical computation data or exact analytical results. For FF, contributions can be compared to DFT (or other ES data) or high-level interatomic potential benchmark suites (specifically for MLFFs)38. For AI, a test dataset is used. Unlike other methods, AI methods can have both “train” and “test” datasets, while others have only “test” sets in the corresponding dataset. For AI methods, if the “train” dataset is not provided and only “test” is given, the benchmark can be used for checking extrapolation behavior such as vacancy formation energy benchmarks.

Analysis of benchmarks

Presently, the leaderboard has 5 categories, 10 sub-categories, 152 methods, 274 benchmarks, 1281 contributions and 8714228 datapoints. In this section, we show a few of the hundreds of example analyses that can be carried out using the available benchmarks and contributions. In Fig. 6, we show the MAE of the AI-computed formation energy and ES-computed bandgap for Si for a variety of contributions in the leaderboard. In Fig. 6(a) we see the comparison of 12 AI models (each AI model had a well-defined 80:10:10 split for training, validation and testing respectively from the JARVIS-3D database) and find the kgcnn_coGN73 has the highest accuracy/lowest error, followed by Potnet74, Matformer75 and ALIGNN76,77 models. This can be attributed to the fact that as we include more structural information and use deep-learning methods rather than descriptor methods, we get improvement in accuracy.

Fig. 6: Examples of benchmark categories available on the JARVIS-Leaderboard.
figure 6

a Artificial intelligence (AI) formationenergy for test set with 5572 materials in JARVIS-DFT 3D dataset, b electronic structure (ES) Si (JARVIS-DFT ID:JVASP-1002) bandgap, c classical force-field (FF) based Voigt bulk modulus of Si and d machine learning force-field (MLFF) based forces for Si. We provide Jupyter/Google colab notebooks to easily plot such comparisons for all available benchmarks. Also, similar analysis figures for all the available benchmarks are available in the Supplementary Information (Supplementary Figs. 1298). As a note, these plots are a current snapshot of the leaderboard, and it is possible that new and more accurate models will be developed and added here in the future.

Similarly, in Fig. 6(b) we compare the bandgap of Si using several methods and find GLLB-sc78 calculated with GPAW79 to yield the lowest error, while G0W080 (VASP81,82), GW080 (VASP81,82), TBmBJ83,84 (VASP), and DMC85 (QMCPACK86) methods follow. This can be attributed to the inclusion of the discontinuity potential (GLLB-sc78) or kinetic energy density (TBmBJ83,84) in the density functional or incorporating many-body physics (G0W080, GW080, DMC85) into the methodology, which can lead to improved accuracy for bandgap prediction. Also, similar methods such as PBE87 data from Open Quantum Materials Database (OQMD)88,89, AFLOW90 and Materials Project65 compare well with each other.

In Fig. 6(c) we compare how several classical FFs compute the Voigt bulk modulus of Si. In Fig 6(d) we compare several MLFF models for the forces of Si. We compare various pretrained MLFFs and other MLFFs we specifically trained on the MLEARN38 dataset (PBE-based DFT data). We see that ALIGNN-FF91 and MatGL92 perform similarly for the prediction of forces. Fig. 6(d) provides a comprehensive comparison of MLFFs that are trained and tested on the same dataset and pretrained models that were trained elsewhere. The comparisons are presented in tabular form for all the benchmarks on the leaderboard website. We have provided tools and notebooks in the leaderboard GitHub repository that can be used for making such plots for all the available benchmarks and contributions. A collection of such figures for method comparison is available in the supplementary information (Supplementary Figures 1-298). We have also added interactive plots for such comparisons on the website. These tools can aid in identifying examples of materials that require high-fidelity methods beyond the accuracy of DFT in order to understand their underlying properties. In addition, these tools can be used to validate electronic structure methods and provide insight for error estimation.

The leaderboard has a large number of benchmarks and can enable a more comprehensive comparison of different methods for better revealing their respective advantages and limitations. For instance, neural networks outperform descriptor-based models by a large degree in all of the 10 regression tasks in the latest Matbench33 leaderboard. To check if this is also the case for 44 regression benchmarks in the current JARVIS leaderboard, we compare the performance of the best descriptor-based model to that of the best neural network. As shown in Fig. 7, the best neural network outperform the best descriptor-based model in 34 tasks, but only 14 out of 44 (32 %) tasks see a performance difference by more than 20 %. This indicates that descriptor-based models are still competitive with respect to neural networks, especially considering their better interpretability and orders of magnitude lower training cost66,67. Notably, the best descriptor-based model is found to outperform the best neural network in 10 tasks including those with 104-105 training data, opening up interesting questions and potential direction to further model improvement. For instance, the inferior performance of neural networks in the regression tasks for the heat capacity and hMOF data may be related to the recently revealed incapability of graph neural networks in capturing periodicity93.

Fig. 7: Relative performance computed as the ratio of the MAE difference between the best descriptor-based model and the best neural networks to the MAE of the best neural networks in the AI regression benchmarks.
figure 7

The benchmark name and the corresponding best performing neural network are indicated in the left and right y axis, respectively. For all the considered AI benchmarks, the best descriptor-based model is the tree-based model using Magpie185 and Voronoi-tessellation201 features. As a disclaimer, these plots are a current snapshot of the leaderboard, and it is possible that new and more accurate models will be developed in the future.

Analysis of error metrics

Although a metric such as the MAE can be useful to compare methods for a specific benchmark, it is difficult to compare across different methods, since MAE values can differ substantially. Hence, we use the mean absolute deviation (MAD, computed with respect to the average value of the training data as a baseline/random-guess model) to MAE ratio for both AI and ES single-property-prediction categories. Mean absolute deviation values act as a baseline/random-guessing model for the benchmark and contributed models should have MAE performance better than MAD values. We show the MAD/MAE ratios for AI and ES benchmarks in Fig. 8. We find that the MAD/MAE values range from 2 to 50. MAD/MAE values close to 1 suggest low predictive power. We observe that quantum properties such as the bandgap have lower MAD/MAE than classical quantities (quantities that do not require quantum mechanical simulations) such as total energy or bulk modulus. Interestingly, such trends for classical vs. quantum quantities are observed for both the AI and ES approaches.

Fig. 8: Performance comparisons in terms of mean absolute deviation (MAD) to mean absolute error (MAE) ratios as uniform criteria.
figure 8

a AI and b electronic structure methods. MAD:MAE serves as uniform criteria for comparing performances of models.

Interactive view of benchmarks and contributions

In addition to making bar plots as shown in Figs. 6 and 8, the raw data available in benchmarks and contributions can be presented in various other forms such as scatter plots, bandstructures, adsorption spectra, and diffraction spectra. In Fig. 9, we show example comparisons of different methods for AI, ES, QC and EXP categories, including (a) formation-energy-per atom model using AI, (b) bulk modulus predictions using ES, (c) electronic bandstructure of Al using QE with different quantum circuits94, (d) CO2 capture for zeolite at several labs in round-robin fashion95. In Fig. 9a), we find that formation energy is one of the easiest quantities to train AI models and even simple chemistry-only-based models can perform reasonably well (i.e., cfid_chem). Including more structural features (such as bond angles and dihedral angles) and using deep learning models (such as graph neural network vs descriptor-based models) further help improve accuracy. Similarly, for ES example for predicting bulk modulus, we find irrespective of DFT based method used, they are in relatively close agreement with experimental bulk modulus data as shown in Fig. 9b). In Fig 9c), we find that the selection of a quantum circuit is critically important for predicting electronic band structures well. Here, we used 6 different quantum94 circuits and found the SU(2)96 circuit to compare well with classical computer-based electronic bandstructures. This can be attributed to various entanglements captured in the SU(2)96 circuits that may be missing in other circuits. Finally, for experimental inter-laboratory/round-robin type measurements of the zeolite CO2 isotherm, we find excellent agreement across different labs95.

Fig. 9: Example results for AI, ES, QC and EXP results.
figure 9

a formation-energy-per atom model using AI for JARVIS-DFT 3D dataset with 5572 materials in the test set, (b) bulk modulus predictions using ES methods for 21 materials, c electronic bandstructure of Aluminum using QC methods with different quantum circuits on a coarse k-point mesh, d CO2 capture for zeolite (ZSM-5) at several labs in inter-laboratory/round-robin fashion.

Methods

The JARVIS-Leaderboard aims to provide a comprehensive framework covering a variety of length and time-scale approaches2 to enable realistic materials design. In this section, we provide a brief overview of the methods that are currently available in the leaderboard. In this work we use the terms categories, sub-categories, methods, benchmarks, and contributions often, so we define them as follows.

Currently, there are five main “categories” in the leaderboard: Artificial Intelligence (AI), Electronic Structure (ES), Force-field (FF), Quantum Computation (QC), and Experiments (EXP). Each category is divided into “sub-categories”, a list of which is provided on the website. These sub-categories include single-property-prediction, single-property-classification, atomic force prediction, text classification, text-token classification, text generation, image classification, image segmentation, image generation, spectra-prediction, and eigensolver. These sub-categories are highly flexible and new categories can be easily added. “Benchmarks” are the reference data (in the form of json.zip file, discussed later) used to calculate performance metrics for each specific contribution. “Methods” are a set of precise specifications for evaluation against a benchmark. For example, within the ES category, density functional theory (DFT) performed with the specifications of the Vienna Ab initio Simulation Package (VASP)81,82, Perdew-Burke-Ernzerhof (PBE)87 functional and PAW81,82 pseudopotentials (VASP-PBE-PAW) is a method. Similarly, within the AI category, descriptor/feature-based models with specifications of MatMiner97 chemical features and the LightGBM98software is a method. “Contributions” are individual data (in the form of csv.zip files) for each benchmark computed with a specific method. Each contribution files consist of six components: category (e.g. AI), sub-category (e.g. SinglePropertyPrediction), property (e.g. formation energy), dataset (e.g. dft_3d), data-split (e.g. test), metric (e.g. mae).

Electronic structure

Electronic structure approaches cover short-length scales and short-time scales with high-fidelity. There are a variety of ES methodologies such as such as tight-binding72,99,100, density functional theory (DFT)101, quantum Monte Carlo85, dynamical mean field theory102 and many-body perturbation theory (Green’s function with screened Coulomb potential, GW methods)80. For each of the methodologies, there are a number of specifications to completely describe a method including the exact software, exchange-correlation functional, pseudopotential, and other relevant parameters. Example methods used in this work are given in Table 2.

Table 2 Summary of current benchmark categories and methods available in the JARVIS-Leaderboard at the time of writing

Each method in the ES category can have a variety of contributions. For example, using a specific method, one can calculate various properties such as bandgaps, formation energies, bulk moduli, solar cell efficiencies, and superconducting transition temperatures as well as spectral quantities such as dielectric functions. While there are more than 400 approximate exchange-correlation functionals proposed in DFT literature103, currently, we have OptB88vdW104, Opt86BvdW105, LDA106, PBE87, PBEsol107, GLLB-sc78, TBmBJ83,84, SCAN108, r2SCAN109, HSE06110, in the leaderboard. We use converged k-points and cut-offs as available in the JARVIS-DFT database111. We have used the Vienna Ab initio Simulation Package (VASP)81,82, ABINIT112,113,114, GPAW79 and Quantum Espresso (QE)115 as DFT software packages, but other packages can be easily added as well. In addition, we use VASP81,82 to perform GW calculations including “single-shot” G0W0 and self-consistent GW0 methods80. Other ES approaches include tight-binding (TB)99 and quantum Monte Carlo (QMC)85. For TB, we use the recently developed ThreeBodyTB.jl code72 along with the Wannier90116 code, while the QMCPACK86 code is used for diffusion Monte Carlo (DMC)85 calculations.

Force-field

Force fields can be used in molecular dynamics and Monte Carlo simulations for studying larger time and length scales compared to electronic structure methods. Traditional force fields are developed for specific chemical systems and applications and may not be transferable to other uses. It is important to check the validity of an FF before using it in a particular application. Moreover, the development of FFs is a cumbersome task. Examples of typical FFs include embedded-atom method (EAM) potentials117 (i.e. Al099.eam.alloy for aluminum system118), Lennard Jones (LJ)119 for 2D liquids, reactive empirical bond order (REBO)120 for Si, and classical, atomistic force fields for biomolecular systems121,122. Recently, machine learning force fields (MLFF)123,124,125,126,127,128 have become popular because of their higher accuracy and ease of development (such as SNAP129 FFs). Nevertheless, early generations of MLFFs were also developed for specific types of chemistry and applications. Very recently, several MLFFs have been developed that can be used to simulate any combination of periodic table elements. Some of these FFs include M3GNET92, ALIGNN-FF91, and CHGNet130. In the leaderboard, we include benchmarks for energies, forces, and stress tensors for both specific systems and universal datasets.

Traditional FFs are available in LAMMPS131, while MLFFs are integrated into the Atomic Simulation Environment (ASE)132 package. Some of these MLFFs are now available in LAMMPS and other large-scale MD codes. In addition to static quantities, FFs can be used for Monte Carlo simulations, such as CO2 adsorption in metal-organic frameworks (MOFs)133 using the RASPA134 code. In addition to energy, force, and stress, we also have FF benchmarks for classical properties such as the bulk modulus. For biomolecular systems, GROMACS135 is commonly used, and we present here free energy differences and conformational state population benchmarks for three model peptides136,137,138.

Artificial intelligence

Recently artificial intelligence methods have become popular for materials prediction across all lengths and time scales. We currently have benchmarks for four types of data used as input for the AI models: (1) atomic structure, (2) spectra, (3) images, and 4) text. AI techniques can be used for both forward prediction and inverse design. For atomic structure datasets, we use DFT datasets such as JARVIS-DFT70,71, Materials Project (MP)65, Tight binding three-body dataset (TB3)72, Quantum-Machine 9 (QM9)139,140. For spectral data, we use either DFT-based spectra of, for example, electron or phonon density of states (DOS), Eliashberg functions, or numerical XRD spectra. For images, we have simulated and experimental scanning transmission electron microscope (STEM) and scanning tunneling microscopy (STM) images for 2D materials. For text data, we have used the publicly available arXiv dataset.

Currently, we have models for feature-based/tabular models (such as RandomForest141, Gradient boosting141, Linear regression141), graph-based models (such as ALIGNN76,77, SchNet142, CGCNN143, M3GNET92, AtomVision144, ChemNLP145) as well as transformers (such as OPT146, GPT30, and T5147). These models use popular AI code bases, including PyTorch148, scikit-learn141, TensorFlow149, LightGBM98, JAX150, and HuggingFace151. These models are used for a variety of properties such as formation energies, electron bandgaps, phonon spectra, forces, text data etc.

Quantum computation

Quantum chemistry is one of the most promising applications of quantum computations152. Quantum computers with relatively few logical qubits can potentially exceed the performance of much larger classical computers because the size of Hilbert space increases exponentially with the number of electrons in the system. Predicting the energy levels of a Hamiltonian is a typical and fundamentally important problem in quantum chemistry. We use Hamiltonian simulations with quantum algorithms and compare it with classical solvers. Determination of appropriate quantum circuit for a specific QC problem is a challenging task. For example, we use the tight-binding Hamiltonians for electrons and phonons in JARVIS-DFT and evaluate the electron bandstructures using quantum algorithms (such as variational quantum eigen solver (VQE)153 and variational quantum deflation (VQD)154) and with different quantum circuits (such as PauliTwo design96 and SU(2)96 circuits). We primarily use the Qiskit96 software in this work through the JARVIS-Tools/AtomQC94 interface, but other packages such as Tequila155, Circq156, and Pennylane157,158 can also be easily integrated. In addition to studying algorithm and circuit architecture dependence, the leaderboard can be used for studying the noise-levels in quantum circuits across different quantum computers, which is a key issue hindering quantum computer commercialization. Currently, we are only using statevector simulators for the quantum algorithms available in the Qiskit96 library.

Experiments

Although experimental results for material properties and spectra are referenced in comparison to computational methods (within the JARVIS-Leaderboard and other leaderboards such as MatBench33), we dedicated a portion of the JARVIS-Leaderboard to experimental benchmarking. Benchmarking experiments essentially boils down to the comparison of different experiments for the same desired result/s. A systematic way to perform this benchmarking is through round-robin testing159. This is an inter-laboratory test performed independently several times, which can involve multiple scientists and a variety of methods and equipment. This approach has been applied successfully for a range of materials science applications95,160,161,162,163, but many more of such experiments are still needed. Specifically in the JARVIS-Leaderboard, we include experimental round-robin results for manometric measurements of CO2 adsorption95. It is important to note that the experimental results included in the leaderboard are for well-characterized materials with well-defined properties and phenomena that can be easily reproduced (in contrast to replication attempts of variable experiments, such as the recent attempt to synthesize room temperature superconductors164,165,166,167). Some of the experiments we used for benchmarking purposes are XRD, magnetometry, vibroscopy, and scanning electron microscopy (SEM) and transition electron microscopy (TEM). We purchase the samples from industrial vendors with available identifiers such as CAS-number. We also carried out XRD for MgB2 (a superconducting material) to verify its crystal structure before carrying out magnetometry measurements to determine the transition temperature. This measurement was compared with numerical XRD data. Magnetometry measurements for superconductors were also conducted to compare their superconducting transition temperatures with respect to predicted or experimentally available values168. Strain-stress measurements were done for Kevlar for failure analysis168. We have several instruments, such as Bruker D8, Titan, Quantum design PPMS and FAVIMAT in the leaderboard currently.

Metrics used

We use several metrics in the leaderboard depending on the “sub-categories” mentioned above. We use mean absolute error (MAE), accuracy (acc), multi-mae (L1 norm of multi-dimensional data), recall-oriented understudy for gisting evaluation (ROUGE) for the single property prediction, single property classification, spectra/eigensolver/atomic forces and textGen/text summary subcategories respectively. As the user contributes their data to compare against the reference data (benchmarks), other complementary metrics (such as those available in the sklearn.metrics library) can be easily calculated as the raw contribution data is also made available through the website. For the sake of readability and ease of use, we primarily employ the metrics mentioned above. For single property prediction, there is only scalar values per column in the csv.zip file with id and prediction separate by comma., For spectra, force-prediction and other multi-value quantities (i.e. with multiple prediction values per id) we concatenate the array and separate by semicolon (to avoid comma convention in csv files). The benchmark data is also stored in a similar format. We provide tools to convert these csv.zip files into json or other file formats if needed. We also provided notebooks to visualize the data through Jupyter/Colab notebooks. In addition, we plan to eventually add metrics for timing, uncertainty, development cost and other details.