|𝔻⟩irac's Student: February 2024

Came across this very useful package that makes loading atomic datasets that contain coordinates, energy, and forces for use in fitting interatomic potentials. It's called Load Atoms and has super simple syntax and seems to be very fast. Also caches the datasets so you don't always re-download. It's really easy to get started.

pip install load-atoms

from load_atoms import load_dataset
dataset = load_dataset("QM9")

This will return a info card:

╭───────────────────────────────── QM9 ─────────────────────────────────╮
│                                                                       │
│   Downloading dsgdb9nsd.xyz.tar.bz2 ━━━━━━━━━━━━━━━━━━━━ 100% 00:09   │
│   Extracting dsgdb9nsd.xyz.tar.bz2  ━━━━━━━━━━━━━━━━━━━━ 100% 00:18   │
│   Processing files                  ━━━━━━━━━━━━━━━━━━━━ 100% 00:19   │
│   Caching to disk                   ━━━━━━━━━━━━━━━━━━━━ 100% 00:02   │
│                                                                       │
│            The QM9 dataset is covered by the CC0 license.             │
│        Please cite the QM9 dataset if you use it in your work.        │
│          For more information about the QM9 dataset, visit:           │
│                            load-atoms/QM9                             │
╰───────────────────────────────────────────────────────────────────────╯

You can then begin using the dataset for any machine learning work. There are also utility methods to prepare/split the dataset. Overall, just a very convenient tool I hope grows in number of available datasets.

Side Treat: Visualization

The other thing I really like is the visualization utility that is built into the package. It uses X3DOM via ASE io for html write format, which I didn't even know existed! You can use this option to display your structures in a Jupyter notebook. The Load Atoms package viewer produces X3DOM HTML visualizations that look a little better than the default ASE outputs. Here is some code snippet and resulting view from Load Atoms:

from ase.build import graphene_nanoribbon
from load_atoms import view


nanoribbon = graphene_nanoribbon(3, 4, 
                                 type='armchair', 
                                 saturated=True,
                                 vacuum=3.5)


nanoribbon.rotate(v='z', a=90, rotate_cell=True)


view(nanoribbon, show_bonds=True)

Available Datasets

It has most datasets that have been used to test different ML and NN models in chemistry, e.g., QM9, but its missing large datasets like the materials project or AFLOW. My guess is it will get there but might be license or size issues (🤷‍♂️). Here is the list that is available:

Dataset	Elements	# Atoms	# Structures	License	Year
AC-2D-22	C	30,000	150	CC BY-NC-SA 4.0	2022
C-GAP-17	C	284,965	4,530	CC BY-NC-SA 4.0	2017
C-GAP-20U	C	400,275	6,088	GPLv3	2020
C-SYNTH-23M	C	23,041,200	115,206	MIT	2022
GST-GAP-22	Ge, Sb, Te	341,132	2,692	CC BY 4.0	2022
P-GAP-20	P	140,910	4,798	CC BY 4.0	2020
QM7	H, C, N, O, S	110,650	7,165	None	2012
QM9	H, C, N, O, F	2,407,753	133,885	CC0	2014
Si-GAP-18	Si	171,815	2,475	CC BY-NC-SA 4.0	2018
SiO2-GAP-22	O, Si	268,118	3,074	CC BY 4.0	2022