LeMat-Bulk: Quantum Chemistry Materials Databases (>5.8M)
LeMat-Bulk is a unified dataset that combines Materials Project, OQMD, and Alexandria, containing over 5.3 million PBE-calculated materials. It also includes the largest collections of PBESol and SCAN functional calculations. The dataset standardizes calculations across different databases and uses a novel hashing function, the Bonding Algorithm Weisfeiller-Lehman (BAWL), for de-duplication. BAWL is shown to be robust and computationally efficient, outperforming existing fingerprinting techniques.
Citation: Siron, Martin, Inel Djafar, Ali Ramlaoui, Etienne du Fayette, Amandine Rossello, Edvin Fako, Matthew McDermott et al. "LeMat-Bulk: aggregating, and de-duplicating quantum chemistry materials databases." arXiv preprint arXiv:2511.05178 (2025).