Graph based representations of molecules have been an active area of research in recent years, with downstream applications in property prediction (HOMO-LUMO gap and various energy properties), to generative modeling of the molecules itself.

We'll explore the generative modelling aspect. It's no secret that genrative models (GenAI in popular lexicon) has taken the world by storm in recent years. There has been remarkable progress in image, text, and video generation. And with some additional effort, generating higher representations of the above (such as code and mathematical proofs) have also seen amazing progress.

One (admittedly outdated by now) method of generative models is score models. This post aims to explain what they are and how they have been applied to the task of generating accurate of molecules as a 3D coordinate matrices. In the ML literature, this task is usually referred to as conformer prediction.

Score Models

Suppose we have a dataset {x1,...,xN}\{x_1, ..., x_N\}. Instead of training to maximize the log-likelihood of the data:

maxθi=1Npθ(x) \text{max}_{\theta} \sum_{i = 1}^N p_{\theta}(x)

by modelling pθp_{\theta} directly, score models try to learn a model sθs_{\theta} that approximates xlog pdata(x)\nabla_x \text{log }p_{\text{data}}(x). The training objective for score models amounts to minimizing the following objective function:

Epdata[sθ(x)xlog pdata(x)22] \mathbb{E}_{p_{\text{data}}} \left[ \left\lVert s_{\theta}(x) - \nabla_x \text{log }p_{\text{data}}(x)\right\rVert_2^2 \right]

This objective function can be shown to be equivalent to the minimizing the following:

Epdata[Tr(xsθ(x))+12sθ22] \mathbb{E}_{p_{\text{data}}} \left[ \text{Tr}(\nabla_x s_{\theta}(x)) + \frac{1}{2}\left\lVert s_{\theta}\right\rVert_2^2 \right]

The trace term is computationally prohibitive, and there have been there are some works to approximate it. One approach, called denoising score matching, is when we perturb a data point x via a predefined noise distribution qσ(x~x)q_{\sigma}(\tilde{x} | x) and estimate the score of the perturbed data distribution qσ(x~)q_{\sigma}(\tilde{x}). The objective to be minimized is now the following:

12Eqσ(x~x)Epdata[sθ(x~)x~log qσ(x~x)22] \frac{1}{2}\mathbb{E}_{q_{\sigma}(\tilde{x} | x)} \mathbb{E}_{p_{\text{data}}} \left[ \left\lVert s_{\theta}(\tilde{x}) - \nabla_{\tilde{x}} \text{log }q_{\sigma}(\tilde{x} | x) \right\rVert_2^2 \right]

In the paper Generative Modeling by Estimating Gradients of the Data Distribution, Song et al. sets qσ(x~x)q_{\sigma}(\tilde{x} | x) as a Gaussian with mean xx and covariance σ2I\sigma^2I. The score model takes in an additional parameter sθ(x,σ)s_{\theta}(x, \sigma), hence the name Noise Conditional Score Network.

Since x~log qσ(x~x)=x~xσ2\nabla_{\tilde{x}} \text{log }q_{\sigma}(\tilde{x} | x) = -\frac{\tilde{x} - x}{\sigma^2}, the loss objective can be simplified to:

12Eqσ(x~x)Epdata[sθ(x~)+x~xσ222] \frac{1}{2}\mathbb{E}_{q_{\sigma}(\tilde{x} | x)} \mathbb{E}_{p_{\text{data}}} \left[ \left\lVert s_{\theta}(\tilde{x}) + \frac{\tilde{x} - x}{\sigma^2} \right\rVert_2^2 \right]

In practice, we use Monte-Carlo estimation to compute the expectation across Epdata\mathbb{E}_{p_{\text{data}}} and Eqσ\mathbb{E}_{q_{\sigma}}.

Sampling from a Score Model

Now that we have a trained score-model, how do we actually sample data? Langevin dynamics states that the following formula:

xt=xt1+ϵ2sθ(x)+ϵzt x_t = x_{t - 1} + \frac{\epsilon}{2}s_{\theta}(x)+ \sqrt{\epsilon}z_t

where ztN(0,I)z_t \sim \mathcal{N}(0, I) ensure that xtx_t will match the data distribution as tt goes to infinity. In the NCSN framework, we follow an annealed version of this:

xt=xt1+ϵ2σiσLsθ(x,σi)+ϵzt x_t = x_{t -1} + \frac{\epsilon}{2} \cdot \frac{\sigma_i}{\sigma_L}s_{\theta}(x, \sigma_i) + \sqrt{\epsilon}z_t

Applying Score Models to Molecular Conformation

Shi et al. were able to apply NCSN's to molecular conformation in Learning Gradient Fields for Molecular Conformation Generation. Let's revisit the conformation generation setup. We can represent a molecule as a graph G=(V,E)G = (V, E), where each viVv_i \in V is an atom with various node features (atomic number, charge), and each eijEe_{ij} \in E represents a bond between atoms ii and jj. Each edge also has it's own features such as bond type and distance. Our task is given a graph GG, predict a 3D coordinate matrix RRV×3R \in \mathbb{R}^{|V| \times 3}.

Conceptually, what we want is to train a score model that estimates Rlog (RG)\nabla_R \text{log }(R | G). This score model needs to be invariant to translation (shifting coordinates by a constant should yield ths same score), and equivariant to rotations (the gradient should rotate along with the coordinates).

In the paper, they make a pretty clever modelling choice. Instead of training a score model directly on a conformation RR, they train a score model on the interatomic distances. This method still works because interatomic distances are invariant to rotation and translation. Therefore, standard graph neural networks can be used while implicitly enforcing intrinsitc geometry, since generating accurate distances implicitly enforces certain geometric constraints. The authors also use 2/3-hop neighbors in the edge matrix EE to enfore even more geometric constraints.

But given a score model for interatomic distances, dijd_ij, how do we compute the score for a conformer? By breaking down Rlog (RG)=fG(d)g(R)\nabla_R \text{log }(R | G) = f_G(d) \circ g(R), where gg is a function that computes the interatomic distances and fGf_G is a GNN taking the molecule features and a distance vector and outputs a conformation or a molecule-level property.

Luckily, these two quantities are related by the chain rule:

sθ(R)i=fG(d)ri=eijEfG(d)dijdijri=eijEsθ(d)ij1dijrirj s_{\theta}(R)_i = \frac{\partial f_G(d)}{\partial r_i} = \sum_{e_{ij} \in E} \frac{\partial f_G(d)}{\partial d_{ij}} \cdot \frac{\partial d_{ij}}{\partial r_i} = \sum_{e_{ij} \in E} s_{\theta}(d)_{ij} \cdot \frac{1}{d_{ij}} \left\lVert r_i - r_j \right\rVert

During sampling, we operate on the distances, and use the chain rule to convert our learned score model over distances to a score model over the respective conformer sθ(R)is_{\theta}(R)_i. The sampling pseudocode is outlined below:

langevin sampling pseudocode

On the conformation prediction task, it achieved SOTA results (for the time).

I was able to implement this in JAX. And here is an example of a sample I was able to generate.

Test alt
Generated molecule for the smiles string C[C@@]1(C#N)[C@@H](O)[C@H]1CO

Take a look at my implementation of this paper here.