Search Blogs

Tuesday, February 27, 2024

Atomic Dataset Convenience Tool

Came across this very useful package that makes loading atomic datasets that contain coordinates, energy, and forces for use in fitting interatomic potentials. It's called Load Atoms and has super simple syntax and seems to be very fast. Also caches the datasets so you don't always re-download. It's really easy to get started.

pip install load-atoms

from load_atoms import load_dataset dataset = load_dataset("QM9")

This will return a info card:

╭───────────────────────────────── QM9 ─────────────────────────────────╮ │ │ │ Downloading dsgdb9nsd.xyz.tar.bz2 ━━━━━━━━━━━━━━━━━━━━ 100% 00:09 │ │ Extracting dsgdb9nsd.xyz.tar.bz2 ━━━━━━━━━━━━━━━━━━━━ 100% 00:18 │ │ Processing files ━━━━━━━━━━━━━━━━━━━━ 100% 00:19 │ │ Caching to disk ━━━━━━━━━━━━━━━━━━━━ 100% 00:02 │ │ │ │ The QM9 dataset is covered by the CC0 license. │ │ Please cite the QM9 dataset if you use it in your work. │ │ For more information about the QM9 dataset, visit: │ │ load-atoms/QM9 │ ╰───────────────────────────────────────────────────────────────────────╯

You can then begin using the dataset for any machine learning work. There are also utility methods to prepare/split the dataset. Overall, just a very convenient tool I hope grows in number of available datasets.

Side Treat: Visualization

The other thing I really like is the visualization utility that is built into the package. It uses X3DOM via ASE io for html write format, which I didn't even know existed! You can use this option to display your structures in a Jupyter notebook. The Load Atoms package viewer produces X3DOM HTML visualizations that look a little better than the default ASE outputs. Here is some code snippet and resulting view from Load Atoms:

from ase.build import graphene_nanoribbon from load_atoms import view
nanoribbon = graphene_nanoribbon(3, 4, type='armchair', saturated=True, vacuum=3.5)
nanoribbon.rotate(v='z', a=90, rotate_cell=True)
view(nanoribbon, show_bonds=True)

Available Datasets

It has most datasets that have been used to test different ML and NN models in chemistry, e.g., QM9, but its missing large datasets like the materials project or AFLOW. My guess is it will get there but might be license or size issues (🀷‍♂️). Here is the list that is available:

Dataset Elements # Atoms # Structures License Year
AC-2D-22 C 30,000 150 CC BY-NC-SA 4.0 2022
C-GAP-17 C 284,965 4,530 CC BY-NC-SA 4.0 2017
C-GAP-20U C 400,275 6,088 GPLv3 2020
C-SYNTH-23M C 23,041,200 115,206 MIT 2022
GST-GAP-22 Ge, Sb, Te 341,132 2,692 CC BY 4.0 2022
P-GAP-20 P 140,910 4,798 CC BY 4.0 2020
QM7 H, C, N, O, S 110,650 7,165 None 2012
QM9 H, C, N, O, F 2,407,753 133,885 CC0 2014
Si-GAP-18 Si 171,815 2,475 CC BY-NC-SA 4.0 2018
SiO2-GAP-22 O, Si 268,118 3,074 CC BY 4.0 2022

As you can see these are fairly limited datasets in terms of the species they include and structures. Its a good start though.


Reuse and Attribution