Celebrating one year of LeMaterial: How open collaboration is accelerating materials discovery
In December 2024, we announced the launch of LeMaterial, an open-source initiative led by Entalpic in collaboration with Hugging Face, designed to accelerate materials discovery by making high-quality datasets, benchmarks, models, and tools openly accessible to the AI for Science community.
A year later, LeMaterial has grown from a first dataset into a broader collaborative effort. This article looks back at LeMaterial’s first year, including the collaborations that made it possible and the contributions released so far.
LeMaterial's first year: building collaboration for AI4Science
LeMaterial’s main focus has been coordinating researchers across institutions to develop shared datasets and benchmarks for AI in materials science. This required significant effort but created the foundation for a community-driven ecosystem.
Over 12 months, LeMaterial provided open, standardized resources that make it easier to compare, train, and evaluate models. This collective effort brings together contributors from leading institutions, including EPFL, Meta, UCSB, University of Cambridge, Mila, Imperial College London, ETH Zürich, IIT Delhi, MIT, IBM, the AI Alliance, as well as the Hugging Face open-science community. By improving data consistency and sharing practices, LeMaterial supports both academic and industrial research, helping reduce duplication of effort and promoting transparent, interoperable development across the field.
In less than a year, more than 100 people have engaged in LeMaterial’s activities, with over 2,000 dataset downloads and four papers (LeMat-Bulk, LeMat-Traj, LeMat-Synth, and LeMat-GenBench) accepted at major AI4Mat venues (ICLR and NeurIPS). Together, these releases give the community access to harmonized crystal structures data, large-scale trajectories data, structured material synthesis data, and a unified benchmark for generative materials models of crystalline structures; all concrete building blocks that now exist in one place. We briefly summarise these contributions below.
LeMat-Bulk: Harmonized crystal structures
The materials data landscape is fragmented in multiple datasets, like Materials Project, Alexandria, and OQMD. Each uses different formats, naming conventions, and parameters.
LeMaterial contributes to addressing this challenge by supporting the building of shared, open resources for AI in materials science. Our first dataset, LeMat-Bulk, unifies, cleans, and standardizes these major open databases into a single format of 6.7-million materials entries.
To make it efficient, we introduced a hashing function called BAWL, a well benchmarked digital fingerprint for crystal structures. This allowed us to identify more than 340,000 duplicates across existing repositories, cleaning the data while maintaining scientific precision. A second version of this dataset, called LeMat-BulkUnique, retains only the most stable structure from each duplicate set, providing a clean basis for model training [1].
LeMat-Traj: Predicting how materials behave
While LeMat-Bulk focuses on static properties, LeMat-Traj captures how materials move and evolve. This dataset aggregates over 120 million atomic configurations from the same 3 datasets as LeMat-Bulk. These relaxation trajectories are essential for training Machine Learning Interatomic Potentials (MLIPs) to accurately predict energies and forces, helping researchers model materials behavior with greater precision.
LeMat-Traj was built using a new open-source library developed for this purpose, LeMaterial-Fetcher. The library automatically fetches, checks, and harmonizes data from multiple sources, ensuring every entry meets quality and consistency standards. It’s designed so that anyone in the community can extend it, adding new datasets, formats, or materials.
When researchers fine-tuned existing AI models on LeMat-Traj, they saw measurable improvements: 36% lower error in predicting atomic forces and a 10% boost in accuracy on an international benchmark for stability prediction. In short: better data led to smarter models [2].
LeMat-Synth: Extracting synthesis recipes
LeMat-Synth focuses on making material synthesis procedures more accessible. It introduces a multi-modal framework using large language models (LLM) and vision-language models (VLM) to extract synthesis procedures from scientific literature.
By analyzing over 80,000 open-access papers, LeMat-Synth builds one of the first large-scale datasets of material synthesis recipes, covering 35 synthesis methods and 16 material classes. It also provides an open extraction pipeline that researches can apply to new papers as literature grows, enabling deeper study of how synthesis conditions relate to material properties. This work represents an important step towards connecting computational predictions with real-world experimental practices [3].
LeMat-GenBench: Generative AI Benchmark
If LeMat-Bulk was about unifying data, LeMat-Traj about capturing materials behavior, and LeMat-Synth material synthesis recipes, then LeMat-GenBench provides a unified benchmark for generative AI models of crystalline structures. By providing a common evaluation framework, metrics implementation and an interactive leaderboard on HuggingFace, it enables researchers to assess whether generative models can propose novel, unique, stable, valid, diverse materials.
In doing so, LeMat-GenBench fosters comparable and transparent evaluation across the community. It also helps move the field from “training models” to “trusting models”, a critical step in translating AI discoveries into real materials [4].
What is coming next ?
Looking ahead to 2026, the LeMaterial roadmap continues to grow, with a set of new initiatives pushed by the community
The LeMaterial Reading Group has been launched, creating a space for researchers to discuss open-science challenges together
LeMat-Rho, a dataset focused on charge densities and electronic properties, is currently in preparation.
The Fetcher pipeline will soon support experimental and molecular-dynamics data, allowing the community to connect quantum calculations with real-world measurements.
A project call will be released soon, inviting researchers to collaborate within LeMaterial on key research directions.
The Entalpic Research Fellowship, a program supporting PhD students and postdoctoral researchers to contribute to open-science projects within LeMaterial.
A year that redefined collaboration
LeMaterial began as an effort to connect existing materials datasets. Over the past year, it has grown into a collaborative space where researchers can work with shared standards and tools.
The progress made so far reflects the contribution of many across the community. As LeMaterial enters its second year, the focus remains the same: making it easier for researchers to access data, test models, and build on one another’s work.
The road is still long, but together, it is possible !
→ Explore LeMaterial on Hugging Face and join the community on slack to contribute to open materials discovery.
Learn more
If you are eager to go deeper, explore these sources:
- Siron, M., Djafar, I., du Fayet, E., Rossello, A., Ramlaoui, A., & Duval, A. “LeMat-Bulk: aggregating, and de-duplicating quantum chemistry materials databases.” OpenReview, AI4Mat Workshop (2025). https://openreview.net/forum?id=w0AsJpgwKq
- Ramlaoui, A., Siron, M., Djafar, I., Musielewicz, J., Rossello, A., Schmidt, V., & Duval, A. “LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling.” arXiv preprint arXiv:2508.20875 (2025). https://arxiv.org/abs/2508.20875
- Lederbauer, M., Betala, S., Li, X., Jain, A., Sehaba, A., Channing, G., Germain, G., Leonescu, A., Flaifil, F., Amayuelas, A., Nozadze, A., Schmid, S. P., Zaki, M., Ethirajan, S. K., Pan, E., Franckel, M., Duval, A., Krishnan, N. M. A., & Gleason, S. P. “LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature.” arXiv preprint arXiv:2510.26824 (2025). https://www.arxiv.org/abs/2510.26824
- Duval, A., Betala, S., Gleason, S. P., Xu, A., Channing, G., Levy, D., Ramlaoui, A., Fourrier, C., Joshi, C. K., Kazeev, N., Kaba, S.-O., Therrien, F., Hernández-García, A., Mercado, R., & Krishnan, N. M. A. “LeMat-GenBench: Bridging the gap between crystal generation and materials discovery.” OpenReview, AI4Mat Workshop (2025). https://openreview.net/forum?id=ZfPGcTfDWn
LeMaterial is developed in the spirit of Open Science. All resources are released under permissive licenses (CC-BY-4.0), supported by Entalpic and Hugging Face, and governed by open working groups to ensure inclusivity and community collaboration.
LeMaterial is standing on the shoulders of giants and we are building upon incredible projects which have been instrumental in the development of this initiative: Optimade, Materials Project, Alexandria, and OQMD, and more to come. Please credit them accordingly when using LeMaterial.
If you use the LeMaterial as a resource in your research, please cite the citation section from our data-card.
CC-BY-4.0 (license used for Materials Project, Alexandria, OQMD) requires proper acknowledgement. Thus, if you use materials data which include (”mp-”) in the immutable_id, please cite the Materials Project. If you use materials data which include (”agm-”) in the immutable_id, please cite Alexandria, PBE or Alexandria PBESol, SCAN. If you use materials data which include (”oqmd-”) in the immutable_id, please cite OQMD. Finally, if you make use of the Phase Diagram for visualization purposes, or the crystal viewer in the Materials Explorer, please acknowledge Crystal Toolkit.
Stay tuned; we’ll continue sharing more about how AI is reshaping materials discovery and how your industry can lead this change.
Written by Lya Campos Rivera, science communicator & materials chemist.
Contact the Entalpic team at contact@entalpic.ai
Written by Lya Campos Rivera, science communicator & materials chemist.
Contact the Entalpic team at contact@entalpic.ai