Search Blogs

Thursday, August 8, 2024

Atomistics with Containers and MACE

The world of classical atomistic simulations has been moving full steam ahead with the development of universal ML potentials, more specifically Graph Neural Networks1. This advance in new capabilities makes running multi-species and multi-phase calculations and simulations much more practical. There are many different variations out there, and Matbench Discovery[1], which I've written about, does a very nice job of capturing the general predictive capabilities of the different models. In this post, I'm covering the MACE2 model[2], which I also explored in a draft preprint I was putting together in my spare time on phonon and elastic predictions using GNNs for shape memory alloys.

Be attentive

Although these GNN models are very capable and seem to be generally very good, caution is still needed when applying them to very specific materials and setups. Just make sure you understand that you should take a qualitative approach rather than a quantitative one, unless you have experimental data to compare to.

What I wanted to cover today is not the specific use or deployment on materials, but rather that I finally figured out how to set up MACE with LAMMPS for CPU inference. I don't have a GPU on my personal machine, so I'll need to use TensorDock to get that working. I used an alternative approach to configuring everything on my native machine environment (i.e., Ubuntu 22.04 LTS) to avoid cluttering my system tree. The tool I used is Apptainer3, which is an HPC containerization tool that leverages Docker.

Apptainer

You'll need to set this up to get started. At first, it looked daunting because compiling from scratch requires a good deal of repository specifications that can easily be outdated based on the last update of the documentation. Fortunately, if you head over to the GitHub repository and release page, you can find different architecture Linux package files (e.g., .deb or .rpm) and install these seamlessly.

Apptainer is a nifty tool because it naturally captures your terminal workflow you would normally follow when setting up a computing environment. You specify the basic Linux commands for setting up and installing things, which is done in the %post section. By default, this is all done with fakeroot settings, so the directory tree will be /root, which can be awkward, but just remember that (maybe there is a way to change this!). The rest of the sections usually specify the runtime settings. You also need to specify the base Linux image to use and the containerization technology. I've only ever used Docker.

I'm still very new to Apptainer/Singularity, so I'm not sure if the definition files I've been putting together are best practices, but I've had success and I really like the isolation and portability they seem to offer.

MACE + LAMMPS

One thing with many of the GNN potentials is that they have mostly been configured to work with ASE using custom ASECalculator class objects. This is super convenient given almost all are developed in Python with PyTorch and use the ASE Atoms class or pymatgen Structure class, the latter of which can easily be converted to ASE Atoms. I like ASE, and doing basic dynamic simulations (i.e., molecular dynamics) is possible. The issue that usually occurs is performance and features. This is where LAMMPS usually shines; it's both performant and very feature-rich.

So, is it possible to combine LAMMPS with these GNN potentials? The short answer is yes, but the complete answer is that it's not very robust or well-tested. I've found that in many of these GNN model frameworks, including MACE, things break—usually from deep within the PyTorch stack, making it hard to troubleshoot. But when you do get it to work, it does work. The issue, though, is matching up computing resource settings so there are no bottlenecks. This I have yet to really figure out. With MACE, it doesn't seem MPI is possible from within LAMMPS, while threading (OMP) does seem to work, though the performance boost seems to be very minimal. The other noticeable aspect is that MACE is memory-hungry, using up 30 GB for a 1000-atom system. πŸ˜…

Okay, so how did I get it to work with Apptainer? Here is the definition file:

BootStrap: docker From: ubuntu:22.04 %labels Author Stefan Bringuier Email stefanbringuier@gmail.com Version 1.1 Description "MACE+LAMMPS (CPU version only)." %post # Install system dependencies apt update && DEBIAN_FRONTEND=noninteractive apt install -y \ python3 \ python3-pip \ python3-dev \ git \ cmake \ g++ \ libopenmpi-dev \ wget \ zip \ ffmpeg \ && apt clean \ && rm -rf /var/lib/apt/lists/* \ /usr/share/doc/* \ /usr/share/man/* \ /usr/share/locale/* # Install ASE and MACE pip3 install --upgrade pip pip3 install ase pip3 install torch==2.2.2 \ torchvision==0.17.2 \ torchaudio==2.2.2 \ --index-url https://download.pytorch.org/whl/cpu pip3 install mace-torch # MKL needed by LAMMPS wget \ https://registrationcenter-download.intel.com/akdlm/IRC_NAS/cdff21a5-6ac7-4b41-a7ec-351b5f9ce8fd/l_onemkl_p_2024.2.0.664.sh sh ./l_onemkl_p_2024.2.0.664.sh -a -s --eula accept # Download and install the CPU version of PyTorch and dependencies wget \ https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-2.2.2%2Bcpu.zip unzip libtorch-shared-with-deps-2.2.2+cpu.zip rm libtorch-shared-with-deps-2.2.2+cpu.zip mv libtorch $HOME/libtorch-cpu # FIX: Add the library path to the dynamic linker configuration echo "$HOME/libtorch-cpu/lib" > /etc/ld.so.conf.d/libtorch.conf ldconfig # Set up LAMMPS with MACE support (CPU version) git clone --branch=mace --depth=1 https://github.com/ACEsuit/lammps cd lammps mkdir build cd build cmake \ -D CMAKE_BUILD_TYPE=Release \ -D CMAKE_INSTALL_PREFIX=$HOME/lammps \ -D CMAKE_CXX_STANDARD=17 \ -D CMAKE_CXX_STANDARD_REQUIRED=ON \ -D BUILD_MPI=ON \ -D BUILD_OMP=ON \ -D PKG_OPENMP=ON \ -D PKG_ML-MACE=ON \ -D CMAKE_PREFIX_PATH=$HOME/libtorch-cpu \ -D MKL_INCLUDE_DIR=/opt/intel/oneapi/mkl/latest/include \ -D MKL_LIBRARY=/opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_intel_lp64.so \ ../cmake make -j 4 make install # Create symbolic link to lammps and clean up pip cache #ln -s $HOME/lammps/bin/lmp /usr/bin/lmp pip3 cache purge %environment export LC_ALL=C export PATH=/root/lammps/bin:$PATH export LD_LIBRARY_PATH=$HOME/libtorch-cpu/lib:$LD_LIBRARY_PATH export CUDA_VISIBLE_DEVICES="" #Default MKL/OMP Threads export OMP_NUM_THREADS=4 export MKL_NUM_THREADS=4 %runscript echo "Starting MACE python environment" python3 "$@" %startscript echo "Shell with MACE and LAMMPS." exec /bin/bash "$@" %help Apptainer with MACE and LAMMPS with MACE support (CPU version). Usage: - To run a Python script with MACE: apptainer run MACE_CPU.sif your_script.py - To start an interactive bash session To build your container image, it's simple: apptainer build MACE_CPU.sif MACE_CPU.def

Then you'll get the .sif file, and you can run it in three different ways:

  1. runscript, which runs a specific command and arguments. In this def file, it's Python: apptainer run MACE_CPU.sif
  2. startscript, which will create an interactive shell for the container: apptainer shell MACE_CPU.sif
  3. The last is specifying a command and arguments to execute: apptainer exec MACE_CPU.sif python3 -c "import mace"

The definition file above does quite a few things in the %post section. It updates the system image, installs a bunch of libraries and tools, and then installs PyTorch and MACE stuff, and finally a custom LAMMPS. This can take a considerable amount of time, and one frustrating thing I cannot figure out is how to create a temporary build so that if it fails, I don't have to start from scratch (i.e., download everything again). It would also be cool if you could have it use the local host system's cache to speed things up.

What's the selling point?

For me, that's easy: a way to nicely sandbox builds so that you don't mess your local system up. Furthermore, in theory, you get a containerized system image that is super portable and can be run on almost any machine. For CPU-based execution, I've pretty much found this to be true.

Now with regard to MACE, there was a lot of tinkering with PyTorch to get this to finally work with Python in the container. The same was the case with LAMMPS. Both are working; however, MACE evaluations are pretty slow. Here are the timings for 250 atoms for 5000 steps with 1 MPI process and 4 OMP threads:

Performance: 0.092 ns/day, 261.944 hours/ns, 0.424 timesteps/s, 106.045 atom-step/s 127.5% CPU use with 1 MPI task x 4 OpenMP threads MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total Pair | 11786 | 11786 | 11786 | 0.0 | 99.99 Neigh | 0 | 0 | 0 | 0.0 | 0.00 Comm | 0.12385 | 0.12385 | 0.12385 | 0.0 | 0.00 Output | 0.0012606 | 0.0012606 | 0.0012606 | 0.0 | 0.00 Modify | 1.2743 | 1.2743 | 1.2743 | 0.0 | 0.01 Other | | 0.009874 | | | 0.00

All the time is in the MACE evaluation part, and as you can see, it is very time-consuming! My guess is there is a lot of trial-and-error to get the best performance or even just different compile settings. I also should mention that the MACE developers clearly state that the LAMMPS interface is still in development and that GPU inference is a much better option.

Example LAMMPS script

One of my intended uses of MACE with LAMMPS is for my NiTi shape memory alloy preprint draft. The use will be to get the temperature phonon dispersion and DOS using the fluctuation-dissipation theory approach implemented in the fix phonon command. In addition, we can use the phase order parameter to track percent phase transformation as a function of temperature or strain [3].

Since these universal GNN potentials support at least up to atomic numbers < 70, there is no real species type limitation. Here is an example LAMMPS script for the NiTi B2 structure:

# LAMMPS NiTi with MACE units metal atom_style atomic atom_modify map yes newton on read_data data.NiTi_B2 #MACE potential pair_style mace no_domain_decomposition pair_coeff * * mace_agnesi_small.model-lammps.pt Ni Ti fix 1 all npt temp 600.0 600.0 1.0 iso 1.0 1.0 10.0 timestep 0.001 thermo 100 thermo_style custom step temp pe etotal press vol run 5000

There are some important notes to make when using MACE with LAMMPS. The first is atom_modify map yes needs to be set. The second is when not using many MPI processes, you should use the no_domain_decomposition option in the pair_style command. The model file mace_agnesi_small.model-lammps.pt is converted using this utility file from this model file. When you run the utility script on the pre-trained model file, you'll get the LAMMPS version.

Till Next Time

For now, this is where I'm at: I've got an Apptainer definition file that works for CPUs and runs MACE alone and with LAMMPS. The goal is to get it running on GPUs through Apptainer on TensorDock. Then I'll get to test how much speed-up can be seen. It's also worth mentioning that there is a JAX implementation of MACE, which is 2x faster than PyTorch. I'm curious if there is a C++ library implementation for JAX to use with LAMMPS, similar to how it was done for the PyTorch version.

Footnotes


  1. For more info on GNN and my efforts, see these posts

  2. I don't really know what the acronym stands for, but the title of the original paper is MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields

  3. I believe Apptainer is the same as Singularity, just a renaming of the open-source version of the product 🀷‍♂️. 


References

[1] J. Riebesell, R.E.A. Goodall, P. Benner, Y. Chiang, B. Deng, A.A. Lee, A. Jain, K.A. Persson, Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions, (2023). DOI.

[2] I. Batatia, D.P. Kovacs, G.N.C. Simm, C. Ortner, G. Csanyi, MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields, in: A.H. Oh, A. Agarwal, D. Belgrave, K. Cho (Eds.), Advances in Neural Information Processing Systems, 2022. URL.

[3] Gur, Sourav, et al. “Evolution of Internal Strain in Austenite Phase during Thermally Induced Martensitic Phase Transformation in NiTi Shape Memory Alloys.” Computational Materials Science, vol. 133, June 2017, pp. 52–59. DOI.



Reuse and Attribution

Thursday, August 1, 2024

Why not just use a CPU

Modern science and engineering uses computing for a vast array of tasks. Some are basic such as text or image processing, while others are more specific to solving difficult math and physics equations using numerical methods1. For much of computing the central processing unit (CPU) was the most important component for handling arithmetic, logic, controlling, and input/output (I/O) operations that are used to solve computational problems. The reason is tha the CPU is a very general purpose device and is very flexible in use. However, this comes with a downside: operations that might be simple to execute could actually be quite time consuming on a CPU.

So what is there to compare to, in other words, what other type of computing device could be used? A processing device that is tailored for specific type of operations. There are application specific integrated circuits(ASICs) which are gate arrays of transistors that can be programmed to behave in a specific manner that is enhanced for a specific type of application or task. Another is the Graphical Processing Unit (GPU) which is what I want to talk about today.

Ancient πŸ—ž️

I'm writing this post because I wanted to capture the difference between GPUs and CPUs, but this post is probably 10 years too late, in other words, many have written post like this before.

The GPU vs. CPU

The fundamental difference between GPUs and CPUs is in their architecture and design philosophy [1-2]. The CPU is considered "general purpose" and is optimized for sequential processing. They typically have a few physical cores (typically 4-16 in consumer models) that are highly sophisticated, capable of handling a wide variety of operations efficiently.

The GPU, on the other hand, was originally designed for rendering computer graphics, which involves performing many similar calculations simultaneously [1-2]. This led to an architecture with hundreds or thousands of simpler cores, each capable of executing the same instruction on different data points simultaneously. This approach, known as Single Instruction, Multiple Data (SIMD), makes GPUs extraordinarily efficient at parallel processing [1].

Why can GPUs be better for Scientific Computing?

We can go through various points to understand why a GPU is better for many scientific2 computing applications:

  1. Parallel Processing: Many scientific algorithms, such as matrix operations, Monte Carlo simulations, and numerical integrations, can be parallelized. GPUs excel at these tasks, potentially offering orders of magnitude speedup compared to CPUs.

  2. Memory Bandwidth: GPUs typically have much higher memory bandwidth than CPUs. This is crucial for data-intensive computations where the bottleneck is often moving data rather than performing calculations.

  3. Floating-Point Performance: Modern GPUs are optimized for floating-point arithmetic, which is essential in many scientific computations. They can perform these operations much faster than CPUs, especially when dealing with single-precision numbers.

  4. Energy Efficiency: For the same computational power, GPUs often consume less energy than CPUs, which can be a significant factor in large-scale scientific computing environments.

  5. Cost-Effectiveness: In terms of raw computational power per dollar, GPUs often outperform CPUs, making them an attractive option for budget-constrained research projects.

When is a GPU not better?

Because GPUs are specialized computing devices, they have some downsides:

  1. Sequential Tasks: Algorithms that can't be easily parallelized often perform better on CPUs due to their higher clock speeds and more sophisticated cores.

  2. Complex Control Flow: Tasks involving many conditional statements or complex branching logic are generally better suited to CPUs.

  3. Small Data: The overhead of transferring data to and from the GPU can outweigh the benefits for small data.

  4. I/O-Bound Tasks: For operations that are limited by input/output speeds rather than computational power, the GPU's advantages may be negated.

Some challenges with GPU programming

For at least 10 years I've tried to get reasonable familiar with GPU programming, mostly with CUDA, which is Nvidia's GPU programming language. There are other GPU programming languages such as OpenCL and HIP that are similar to CUDA. The issue is they are less adopted and seem more difficult. But in general I have found GPU programming to be a tour-de-force, that is it is always more difficult then I anticipate. Here are some reasons:

  1. Different Programming Model: GPU programming requires thinking in terms of parallel execution, which can be counterintuitive or unnatural.

  2. Memory Management: GPU memory is limited and efficient GPU programming requires careful management memory types and data transfer between CPU and GPU.

  3. Debugging Tools: I've never been good with debugging tools, but GPU code is much harder to debug.

  4. Optimization: Achieving peak performance usually requires understanding specific GPU architectures.

Example PyCUDA code

To illustrate GPU programming in a context relevant to materials science, let's consider a simple example of calculating the Lennard-Jones potential, which is commonly used to model interatomic interactions in molecular dynamics simulations. Here's a CUDA implementation using Python and PyCUDA [3]:

import numpy as np import pycuda.autoinit import pycuda.driver as drv from pycuda.compiler import SourceModule
# GPU implementation cuda_module = SourceModule(""" global void lennard_jones_gpu(float potential, float positions, int num_atoms, float epsilon, float sigma) { const int i = threadIdx.x + blockDim.x * blockIdx.x; if (i < num_atoms) { float pote = 0.0f; float3 pos_i = make_float3(positions[3i], positions[3i+1], positions[3i+2]);
float sigma6 = powf(sigma, 6); float sigma12 = sigma6 * sigma6; for (int j = 0; j < num_atoms; j++) { if (i != j) { float3 pos_j = make_float3(positions[3
j], positions[3j+1], positions[3j+2]); float3 r = make_float3(pos_i.x - pos_j.x, pos_i.y - pos_j.y, pos_i.z - pos_j.z); float r2 = r.xr.x + r.yr.y + r.zr.z; float r6 = r2r2r2; float r12 = r6r6;
pote += 4.0f * epsilon * ((sigma12/r12) - (sigma6/r6)); } } potential[i] = pote; } } """)
lennard_jones_gpu = cuda_module.get_function("lennard_jones_gpu")
# Parameters num_atoms = 1000 epsilon = 1.0
sigma = 1.0

# Arrays positions = np.random.rand(num_atoms * 3).astype(np.float32) potential = np.zeros(num_atoms).astype(np.float32)
# Call the kernel lennard_jones_gpu( drv.Out(potential), drv.In(positions), np.int32(num_atoms), np.float32(epsilon), np.float32(sigma), block=(256, 1, 1), grid=((num_atoms + 255) // 256, 1) )

The keyword args block and grid specify to the kernel how to perform the parallelization (i.e., how to decompose the arrays into parallel computes). This example demonstrates how we can use GPU computing to calculate the Lennard-Jones potential for a system of atoms, which is a fundamental operation in many molecular dynamics simulations. The GPU's parallel processing capabilities allow us to compute the potential for many atoms simultaneously, which can lead to significant speedups compared to CPU-based calculations, especially for larger systems.

Benchmarking CPU vs GPU: Lennard-Jones Potential Calculation

To truly appreciate the performance gains in using GPUs for computational materials science, I put together a jupyter notebook which compares the execution time of Lennard-Jones potential calculations on both CPU and GPU. I think it helps demonstrate how much speed-up can be had with GPU hardware for the right problem.

Number of Atoms Single CPU Time (s) GPU Time (s) Speedup
100 0.1490 0.0008 ~1830x
500 2.7130 0.0009 ~3100x
1000 10.8806 0.0009 ~11000x
2000 43.7833 0.0127 ~3400x
5000 274.5926 0.0135 ~20000x

One pecularity is that the 2000 atom system shows a drop in speed-up, not sure what causes this. It could be the block and grid settings, but then for 5000 atoms we see the speed-up factor jumps again.

Attention

The speed-up will be GPU hardware dependent and I'm not comparing it to multi-core CPU performance. The hardware used to generate the speed-up table was a RTX A4000 GPU so other cards may be better or worse, just depends. It's also important to note that this is a simplified example. Practical atomistic simulations often include additional routines and more complex force fields.

Summary

I think this benchmark demonstrates why GPUs can be so powerful for scientific computing tasks like MD simulations. As the number of atoms in the system increases, the GPU's ability to perform many calculations in parallel leads to increasingly dramatic performance improvements over an single CPU. Yes MPI and OpenMP and can be used to provide great speed-ups.

As I continue to explore the use of GPUs in materials science and AI, I'm pretty excited by the possibilities these performance improvements open up. With GPUs, we can simulate larger systems for longer timescales, potentially leading to new insights in materials research and drug discovery.

Footnotes


  1. In numerical methods, the problem at hand is represented in a discretized fashion that is solved using basic arithmetic operations. However, it is possible to represent the problem using symbolic formalism and then the symbolic form is manipulated based on rules and conditions to arrive at a solution. 

  2. I'm using scientific computing applications as a stand in for both physical science and ML/AI computing. 

References

[1] D.B. Kirk, W.W. Hwu, Programming Massively Parallel Processors, Third Edition: A Hands-On Approach, CreateSpace Independent Publishing Platform, 2017. URL.

[2] J. Nickolls, W.J. Dally, The GPU Computing Era, IEEE Micro 30 (2010) 56–69. DOI.

[3] A. KlΓΆckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, A. Fasih, PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation, Parallel Computing 38 (2012) 157–174. DOI.



Reuse and Attribution