Awesome Materials & Chemistry Datasets
A curated list of the most useful datasets in materials science and chemistry for training machine learning and AI foundation models. This includes experimental, computational, and literature-mined datasets, prioritizing open-access resources and community contributions.
Extraction of materials relevants for convenience. Visit original page for whole list.
Materials Datasets from Awesome Materials & Chemistry Datasets
| Dataset | Domain | Size | Type | Format |
|---|---|---:|---|---|
| [OMat24 (Meta)](https://huggingface.co/datasets/fairchem/OMAT24) | Inorganic crystals | 110M DFT entries | Computational | JSON/HDF5 |
| [OMC25](https://huggingface.co/facebook/OMC25) | Molecular crystals | >27M structures | Computational | Zarr |
| [Materials Project (LBL)](https://materialsproject.org) | Inorganic crystals | 500k+ compounds | Computational | JSON/API |
| [Open Catalyst 2020 (OC20)](https://opencatalystproject.org) | Catalysis surfaces | 1.2M relaxations | Computational | JSON/HDF5 |
| [AFLOW](https://aflow.org) | Inorganic materials | 3.5M materials | Computational | REST API |
| [OQMD](https://oqmd.org) | Inorganic solids | 1M+ compounds | Computational | SQL/CSV |
| [JARVIS-DFT (NIST)](https://jarvis.nist.gov) | 3D/2D materials | 40k+ entries | Computational | JSON/API |
| [Carolina Materials DB](http://www.carolinamatdb.org) | Hypothetical crystals | 214k structures | Computational | JSON |
| [NOMAD](https://nomad-lab.eu/prod/v1/gui/search/entries/search/entries) | DFT/MD workflows | >19M calculations | Computational | JSON |
| [MatPES](https://matpes.ai) | DFT potential-energy surfaces | ~400,000 structures from 300K MD simulations | Computational | JSON |
| [C2DB](https://c2db.fysik.dtu.dk) | 2D materials | ~4,000 entries | Computational | JSON/API |
| [CoRE MOF 2019](https://pubs.acs.org/doi/10.1021/acs.jced.9b00835) | Metal-organic frameworks | 14,763 structures | Computational | CIF/JSON |
| [QMOF Database](https://figshare.com/articles/dataset/QMOF_Database/13147324) | Metal-organic frameworks | 20k+ DFT structures | Computational | CIF/JSON |
| [Catalysis-Hub](https://www.catalysis-hub.org) | Surface reactions | >100k energies | Computational | JSON/API |
| [ODAC23](https://fair-chem.github.io/dac/datasets/odac.html) | MOF + CO₂/H₂O adsorption | 38M DFT calculations | Computational | HDF5 |
| [MOFX-DB](https://doi.org/10.1021/acs.jced.2c00583) | Gas adsorption in MOFs | 3M isotherm points | Computational | CSV/HDF5 |
| [LeMat-Bulk](https://huggingface.co/datasets/LeMaterial/LeMat-Bulk) | Inorganic materials, bulk | 6.7M structures / 5.9M materials | Computational | Hugging Face Dataset |
| [LeMat-Traj](https://huggingface.co/datasets/LeMaterial/LeMat-Traj) | Inorganic materials trajectories | 113M structures | Computational | Hugging Face Dataset |
| [NeurIPS Open Polymer Prediction 2025](https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025/data) | Polymers | ~1,500 test polymers with MD-derived properties | Computational | CSV |
| [Carbon Data](https://github.com/jla-gardner/carbon-data) | Carbon materials | 22.9M atoms; 546 trajectories | Computational | EXTXYZ |
| [MP-ALOE](https://figshare.com/articles/dataset/MP-ALOE_An_r2SCAN_dataset_for_universal_machine_learning_interatomic_potentials/29452190) | Universal MLIPs, 89 elements | ~1M r2SCAN DFT calculations | Computational | JSONL/MACE |
| [Alexandria DB](https://alexandria.icams.rub.de) | Inorganic 1D–3D materials | >5M PBE DFT calculations | Computational | JSON/OPTIMADE/LMDB |
| [Quantum-Chemical Bonding DB / LOBSTER](https://zenodo.org/records/8091844) | Solid-state bonding analysis | 1,520 compounds | Computational | JSON |
| [Matbench v0.1](https://matbench.materialsproject.org/) | Materials ML benchmarks | 13 tasks / 10 datasets | Benchmark / Computational | CSV/HDF5 |
| [Matbench Discovery](https://matbench-discovery.materialsproject.org/data) | Stability, thermal conductivity, structures | Multiple files | Benchmark / Computational | CSV/ZIP |
| [Materials Cloud Archives](https://archive.materialscloud.org) | DFT/MD workflows | 1,000+ datasets | Computational | HDF5/JSON/CIF |
| [MS25](https://zenodo.org/records/10901820) | MLIP benchmark, 6 material systems | Multi-system benchmark suite | Computational / Benchmark | HDF5 |
| [RadonPy Polymer Properties Data](https://github.com/RadonPy/RadonPy/blob/648c9a492808339c9bb7ad2c1137e5a7b07614ca/data/PI1070.csv) | Polymers | ~1,070 MD-calculated properties | Computational | CSV |
| [AQCat25](https://huggingface.co/datasets/SandboxAQ/aqcat25-dataset) | Catalysis | 13.5M frames / 5k materials | Computational | Parquet/ASE DB |
| [Quantum Cluster Database](https://muellergroup.jhu.edu/qcd/) | Nanoclusters | 63,015 clusters | Computational | CSV/JSON |
| [The Cambridge Cluster Database](https://www-wales.ch.cam.ac.uk/CCD.html) | Mixed clusters | Multiple files | Computational | Multiple formats |
| [Battery Electrolyte Solvation/Ionization](https://doi.org/10.5281/zenodo.15252439) | Battery electrolyte molecules | Thousands of molecules | Computational | Not specified |
| [Crystallography Open Database (COD)](https://www.crystallography.net/cod) | Crystal structures | ~525k entries | Experimental | CIF/SMILES |
| [NIST ICSD subset](https://icsd.products.fiz-karlsruhe.de) | Inorganic structures | ~290k structures | Experimental | CIF |
| [CSD / Cambridge Structural Database](https://www.ccdc.cam.ac.uk) | Organic crystals | ~1.3M structures | Experimental | CIF |
| [opXRD](https://doi.org/10.5281/zenodo.14254270) | Crystal structures / XRD | 92,552 entries; 2,179 labeled | Experimental | JSON |
| [MDR SuperCon](https://mdr.nims.go.jp/collections/4c428a0c-d209-4990-ad1f-656d05d1cfe2) | Superconductors | Legacy superconductor database | Mixed | Not specified |
| [HTEM-DB](https://htem.nrel.gov) | Thin-film composition libraries | 140k+ samples | Experimental | JSON/API |
| [OCx24](https://github.com/facebookresearch/fairchem/tree/main/src/fairchem/applications/ocx/data) | Electrocatalyst inks | 572 samples plus DFT | Experimental | CSV |
| [Polymer Genome](https://khazana.gatech.edu/dataset/) | Polymers | 20k polymers | Experimental + Computational | CSV/JSON |
| [CoRE MOF 2024](https://www.ccdc.cam.ac.uk/support-and-resources/downloads/) | Metal-organic frameworks | 40k+ experimental MOFs | Experimental | CIF |
| [Anion Solvation DB](https://doi.org/10.5281/zenodo.13987781) | Anion solvation | ~26k properties | Mixed | CSV |
| [StarryData2](https://github.com/starrydata/starrydata_datasets) | Experimental materials properties | Figshare dump, 2023/2024 | Experimental | CSV/JSON |
| [CRIPT Polymer Data](https://www.criptapp.org) | Polymer synthesis and properties | Growing community database | Mixed | JSON/API |
| [Battery Imaging Library (BIL)](https://www.batteryimaginglibrary.com) | Battery imaging | 80+ scans; >500B voxels | Experimental | Various |
| [MaScQA](https://github.com/abhijeetgangan/MaSTeA) | Materials-science QA | 640 QA pairs | LLM Training | XLSX |
| [SciCode](https://scicode-bench.github.io) | Research coding in physics, math, materials science, biology, and chemistry | 338 subproblems | LLM Training | JSON |
| [MatSci-Instruct / HoneyBee](https://zenodo.org/record/10119842) | Materials science | ~55k verified instructions | LLM Training | JSON |
| [MaCBench](https://macbench.lamalab.org) | Chemistry and materials science | Vision-language tasks | LLM Training | JSON |
| [MatText](https://huggingface.co/datasets/n0w0f/MatText) | Materials property prediction | 2M structures | LLM Training | Hugging Face Dataset |
| [MatScholar Task-Schema QA / MatSci-NLP](https://github.com/BangLab-UdeM-Mila/NLP4MatSci-ACL23) | Materials-science NLP | Tens of thousands of examples | LLM Training | JSON |
| [TextEdge](https://drive.google.com/drive/folders/1YCDBzwjwNRIc1FRkB662G3Y5AOWaokUG?ths=true) | Crystal properties | Crystal text descriptions with propertie
Acknowledgement: Ben Blaiszik Globus Labs researcher, machine learning & AI in scientific applications