Search Blogs

Tuesday, January 23, 2024

Shotgun Review in Fitting Metrics

In machine learning and statistical modeling, evaluating the quality of a model's predictions is the most fundamental aspect. You want to know did the the neural network or polynomial function capture the observed data. Two commonly used metrics are Mean Squared Error (MSE) and Chi-squared ($\chi^2$). While both measure the goodness of fit/prediction, they differ in their approaches and applications. In this post I'm just going to shotgun review differences, emphasizing how $\chi^2$ incorporates the standard deviation of observed data points.

Mean Squared Error (MSE): Basic staple in ML

MSE is widely used in machine learning and statistics for assessing model accuracy. It is defined as:

\begin{equation} \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y_i})^2 \label{eq:mse} \end{equation}

where $y_i$ are the observed values, $\hat{y_i}$ are the predicted values, and $N$ is the number of observations. MSE is favored for its simplicity and effectiveness, especially where the goal is accurate prediction across a diverse dataset. Keep in mind that if your dealing with datasets that are in the 10's of thousands of data points, a metric like MSE make evaluation very straightfoward.

Chi-Squared ($\chi^2$): Fit quality using Data point uncertainties

\begin{equation} \chi^2 = \sum \left( \frac{y_i - f(x_i)}{\sigma_i} \right)^2 \label{eq:chi} \end{equation}

Chi-squared is useful when standard deviations or uncertainties of observed values are known and vary. In $\chi^2$, each term of the sum is weighted by the inverse of the variance of each observation, making it sensitive to model fit relative to data point accuracy. One way to think about what $\chi^2$ tells us is to compare terms in the numerator and denominator, if the difference in the numerator is large and the uncertainty/error is small then we get a large value, indicating poor predictive ability of the fit. If the difference is small compared to the uncertainty then the predictive ability could be too agressive (i.e., overfitted). Just to idle on this a bit more, lets put some actual examples of numbers:

\begin{align*} \chi^2 &= \left( \frac{10 - 12}{2} \right)^2 + \left( \frac{20 - 19}{3} \right)^2 + \left( \frac{30 - 29}{4} \right)^2 \\ &= 1 + 0.1111 + 0.0625 \\ &\approx 1.1736 \end{align*}

Upon inspection you can see how individual terms change based on the numerator and denominator values.

Propagating Uncertainty in Model Parameters

Both MSE and $\chi^2$ provide an average perspective on the quality of fit. However, an important aspect in model evaluation is understanding the uncertainty in the fit parameters themselves, such as weights in neural networks or coefficients in polynomial functions. This is particularly meaningful when there is error/uncertaintly on the observed values. To capture this we have to utlize uncertainty propagation which is crucial:

  • In models with higher parameter uncertainty, predictions might be less reliable, even if the overall fit quality (as measured by MSE or $\chi^2$) is good.
  • Techniques like error propagation, confidence intervals, and Bayesian methods can be used to quantify the uncertainty in model parameters.
  • Understanding parameter uncertainty helps in assessing the model's predictive power and robustness, particularly in scenarios where predictions are used for critical decision-making.

Summary

Both MSE and $\chi^2$ are valuable for assessing model prediction quality, but their applicability usually depends on the nature of the data and modeling objectives. In ML it seems $\chi^2$ is rarely used (personally haven't see it) where MSE is a standard loss function. The propagation of uncertainty in model parameters further adds to the complexity of model evaluation, but can be very important in assessing model reliability.

References

[1] A.C. Fischer-Cripps, The Mathematics Companion: Mathematical Methods for Physicists and Engineers, 2nd Edition, Routledge & CRC Press. URL.


Reuse and Attribution

Tuesday, January 16, 2024

Materials Project REST API w/ Julia

The Materials Project (MP) is one of the most successful computational materials science databases for exploration of materials properties [1]. On top of that, it's also large enough to do some serious machine learning for materials design and discovery. There are for sure limitations on the data generated1, but I'm not going to touch on that. What I'll focus on is actually how to grab this data via the API. The reason I'm creating a post is:

  1. To remind myself on how to use HTTPS request for a REST API.
  2. Show how straightforward it is to do in Julia.

For most, the way to access the MP data is to use the python MPRester package, which integrates with the very useful pymatgen package. However, I do a lot in Julia and although it is very easy to install python packages and use them from within Julia, it adds an additional dependency and overhead. So here I'll so how simple it is to use the REST API with only two Julia packages: HTTP and JSON. Both of these packages are implemented in pure Julia and are very mature within the Julia package eco-system, so one does not need to worry about supported operations. You can install these packages in the REPL with:

using Pkg Pkg.add("HTTP") Pkg.add("JSON")

Background

I was not familiar with REST API until about 2 years ago. A REST API (Representational State Transfer Application Programming Interface) is a set of protocols and standards used for exchanging data between systems. It utilizes HTTP requests to access and manipulate data, which can be in a variety of various formats but typically in JSON or XML. The most common use is to enable interactions between client and server in web applications, allowing for operations such as retrieving, updating, or deleting data stored on the server. Its a favorable architecture because it is scalable and performant, while also standardizing communication among different systems.

Since it uses HTTP protocol almost anything connected to the internet can use a REST API. Furthermore any modern computing language that has a library with functions that support HTTP can by defacto become a API. Thus its very easy to implement calls in Julia.

Creating a function to grab an MP entry

We can now implement a function that grabs the summary data for the a MP entry id. To use the MP REST API we need to have a base url, endpoint, and an operation. The base url for the materials project REST API is just: https://api.materialsproject.org. What is an endpoint? An endpoint is just a particular location and is associated with a particular operation or set of operations that can be performed on a resource, such as retrieving, creating, updating, or deleting data. These endpoints are accessed through standard HTTP methods like GET, POST, PUT, and DELETE. Since we have no admin privileges, we can only use GET.

Okay, there is one last thing we need. Many APIs require authentication, meaning you need to be an approved user. To do so they typical use a unique digital key, which is nothing but a token that consist of symbols2. To get a MP api key, you need to sign-up. We now have everything we need.

Note

For this function I'm just going to use the summary endpoint which provides a fairly comprehensive dataset for a materials project ID. You can modify the endpoint to get more specific data (i.e., /materials/thermo/ )

function get_mp_summary(id::String, api_key::String, all_fields=true) base_url = "https://api.materialsproject.org" endpoint = "materials/summary/$(id)?_all_fields=$(all_fields)" query_url = joinpath([base_url,endpoint]) headers = ["accept" => "application/json", "X-API-KEY" => api_key] response = HTTP.get(query_url,headers) data = JSON.parse(String(response.body)) return data end

As you can see its a very small amount of code. The endpoint variable provides the specifics about our query, which points to a materials project id, and then uses the ? to indicate a new query that states if all the data fields should be returned or not. In this case the query for all fields is yes (i.e., true). To create the HTTP url we just combine everything into query_url. The next aspect is the header variable, which is a dictionary that specifies the we are expecting a JSON formatted data and the value of the api key. Finally, we make the HTTP request and then parse the returned JSON to a Julia dictionary. Thats it!

The data

What we get is a fairly deep dictionary structure, so its useful here to go through at least the first two layers. I'll illustrate for mp-510604, which is Mn$_2$O$_3$. The first key for the dictionary is the data:

data = get_mp_summary("mp-510604",MP_API_KEY) keys(data)

KeySet for a Dict{String, Any} with 1 entry. Keys: "data"

The data key only has a single entry that is a Dict{String,Any}, this is where all the material structure and property data is. So we now want to go deep into the data, lets list all the keys in the Dict{String,Any} for data:

keys(data["data"][1])

KeySet for a Dict{String, Any} with 70 entries. Keys: "e_ionic" "chemsys" "weighted_surface_energy_EV_PER_ANG2" "material_id" "homogeneous_poisson" "deprecated" "shape_factor" "uncorrected_energy_per_atom" ⋮

Once you understand the structure of the data, you can then proceed how you intend to use the MP, thats it. As noted earlier, you can change the endpoint to look at other material properties that are calculated from the DFT data.

Footnotes


  1. One of my pet peeves is that DFT calculations have become so revered that they are often used without sufficient caution or scrutiny. As numerical methods improve, with better functionals for exchange-correlation (XC) and corrections such as DFT+U, the uncritical acceptance of large datasets using standard DFT calculations, i.e., GGA or meta-GGA, seems questionable. My bias is that the quantum physics of materials is in truth a many-body problem as well as not merely a ground-state one. Thus, if you're using large "approximate" datasets to train ML models, then you're likely to encounter problems when moving away from in-silico! Feel free to correct me. 

  2. My guess is the token is actually just a public or private ssh-key for the REST-API server that is assigned to you. 

References

[1] A. Jain, et al., Commentary: The Materials Project: A materials genome approach to accelerating materials innovation, APL Materials 1 (2013) 011002. https://doi.org/10.1063/1.4812323.


Reuse and Attribution

Tuesday, January 9, 2024

Dill Pickled Phonons

As you may have noticed I've been using the ASE package to do some phonon calculations in my free time. One nice feature is the ASE Phonons class, which is very useful programming interface for doing phonon calculations for different calculators and atomic structures. However, one of the limitations is that the Phonons object has no way to write the state of the object, in other words, if you want to recalculate the band structure from the dynamical matrix (in real-space), you would need to write function that saves it. Then in order to use it and plot the band structure, say along another q-point path than you original had done, you would need to reconstruct the Phonons object and all other details. This is cumbersome, so why not make some python pickles, specifically the dill variety.

The Python pickle package is a way implement binary protocols for writing/reading Python object structure (i.e., serializing). The process focuses on the object hierarchy to convert/uncovert it to a byte stream, this is pickling. So why do we have dill pickling, because some class methods and attributes are not easily pickled and thus we can do so with the dill package. What does this mean for the user? We can write the state of our Phonon object/calculation to a pickle/dill file and then use it latter to re-plot the band structure.

Our First πŸ₯’ Batch

To begin we create an aluminum unit cell and then instantiate the Phonons. I'm using the built in EMT calculator for simplicity.

from ase.build import bulk, molecule from ase.calculators.emt import EMT from ase.phonons import Phonons
atoms = bulk('Al', 'fcc', a=4.05) N = 7 ph = Phonons(atoms, EMT(), supercell=(N, N, N), delta=0.05) ph.run() ph.read(acoustic=True) ph.clean()

The band structure is obtained using the get_bandstructure^[a] method which requires the q-point path. An example output is

BandStructure(path=BandPath(path='GXWKGLUWLK,UX', cell=[3x3], special_points={GKLUWX}, kpts=[50x3]), energies=[1x50x3 values], reference=0.0)

Now comes our functions to perform the dill pickling:

import dill as pickle def pickling_phonons(phonons_obj,filename="phonons.pkl"):
with open(filename, 'wb') as f: pickle.dump(phonons_obj, f) return None
def unpickle_phonons(filename="phonons.pkl"): with open(filename, 'rb') as f: dill_pickle_phonons = pickle.load(f) return dill_pickle_phonons

and we can easily perform pickling:

pickling_phonons(ph)

and then unpickle with

ph_pkl = unpickle_phonons()

which nicely gives:

BandStructure(path=BandPath(path='GXWKGLUWLK,UX', cell=[3x3], special_points={GKLUWX}, kpts=[50x3]), energies=[1x50x3 values], reference=0.0)

if we call the get_band_structure method with the same q-point path. So that it, a very straightforward way to save your ASE.Phonons object for later use. This will be particularly helpful for large supercells or calculators that are costly to evaluate again. I've coded this into the my fork of the ASE main branch and have this example included as a test, however, I'm not sure this would make it into the next release as it probably isn't robust enough against edge cases.

Note

One limitation is that pickling, even with dill, will not work for the ASE calculator objects that are tied to external shared lib objects. For example if you use the LAMMPSlib interface you will most likely get a failure since this links the LAMMPS shared C++ library objects and data types (I think this is how it works?).

Let me know if you have any suggestion on better approaches or a more comprehensive way to store the state of ASE.Phonons, but in general I do enjoy making πŸ₯’.

Footnotes


  1. The method call and q-point path would be: ph.get_band_structure(atoms.cell.bandpath())), but the point is with the pickled Phonons object you could unpickle it and the provide a different q-point path. 

References

[1] Larsen, A. H.; et al., The Atomic Simulation Environment—a Python Library for Working with Atoms. J. Phys.: Condens. Matter 2017, 29 (27), 273002. https://doi.org/10.1088/1361-648X/aa680e.


Reuse and Attribution

Thursday, January 4, 2024

Entropy: Thinking Beyond Disorder

Every so often, I casually refer to terms like entropy or free energy and forget to really think about what the fundamental meaning of those words are. So, with this post, I aim to focus on the meanings of these terms (i.e., without involving math), particularly entropy.

Entropy: More Than Just Disorder

Entropy is often described in terms of 'disorder' or 'randomness', but these terms, in my opinion, can be misleading and confusing. When we consider epistemology of a system, that is the study of what we can know about a system, entropy is meaningful only from a statistical perspective. This is because, scientifically, we understand that everyday matter1 is made up of microscopic constituents (i.e., atoms or molecules), yet it's impossible2 to know all details about these constituents. Therefore, we rely on a statistical representation of their collective behavior. Entropy is fundamentally about the number of ways a system composed of matter (or information) can be arranged at the microscopic level while still appearing the same at the macroscopic level. It measures the diversity of microstates corresponding to a given macrostate.

As a result, I think it's best not to conceive 'disorder' in the traditional sense, where objects are randomly spread out in a chaotic fashion, but rather as the numerous microstates a system can adopt while maintaining familiar macroscopic properties, such as temperature or pressure. Thus, entropy reflects a lack of specific information3 about individual microstates of a system, while still enabling knowledge of macroscopic properties.

The 'randomness' in entropy relates to the probabilistic nature of microstates. At the microscopic level, the exact configuration of particles in a system follows the laws of probability. This aspect of randomness is key to understanding why macroscopic properties of systems emerge as averages over many microstates. Simply put, we can never truly know what microstate the system is in, only the expectation value of a given microstate while considering all other microstates.

Nature's Tendency To Maximize

What we empirically observe in nature, is that matter at the microscopic and mesoscopic scales, tends towards the largest possible configurational space of microscopic states. This tendency to maximize the number of accessible microstates drives the natural progression towards states of higher entropy and thermodynamic equilibrium. Recall that entropy relates the microstates to the macrostate, and thus nature aims to maximize the number of microstates corresponding to an equilibrium macrostate.

To recap, entropy, often encapsulated in terms like 'disorder' and 'randomness', is more easily thought of as a nuanced measure of the probabilistic distribution of microstates in a system. Understanding entropy in this way helps better appreciate the intricate relation between microscopic thermodynamics (i.e., statistical mechanics) and macroscopic thermodynamics.

Bonus: Free Energy

Free Energy is another term worth digesting. It's best understood as the energy freely available in a system to do work, hence the name free energy. This concept is intertwined with the energy content of a system and its capacity to perform work. Under the laws of thermodynamics, free energy represents the principle that, while energy cannot be created or destroyed, it can be transferred or transformed. Thus, free energy is a conceptual tool that helps us understand how energy moves within a system and its surroundings.

Footnotes


  1. A famous thought experiment that attempts to circumvent the idea of not being able to determine what every constituent particle is doing is Maxwell's Demon. It was used as a way to violate the second law of thermodynamics for an adiabatic system, $\Delta S \ge 0$. 

  2. Our best scientific theories, confirmed by experiment, indicate the fundamental constituents of matter are more elementary particles than atoms and molecules. However, for the most part, our interactions with the physical world at the smallest scales can be well described by thinking in terms of electrons, atoms, molecules, and the like. 

  3. In information theory, entropy is used to quantify details about information. So that rather than the microscopic states of matter, one thinks about how information can be represented. This was one of the seminal results put forth by Claude Shannon and I believe inspired by John von Neumann


Reuse and Attribution