Introduction

In the context of cryo-EM, many computationally exhaustive methods rely on simpler representations of cryo-EM density maps to overcome their scalability challenges. There are many choices for the form of the simpler representation, such as vectors (Han et al. 2021) or a mixture of Gaussians (Kawabata 2008). In this post, we discuss a format that is probably the simplest and uses a set of points (called a point cloud).

This problem can be formulated in a much more general sense rather than cryo-EM. In this sense, we are given a probability distribution over \(\mathbb{R}^3\) and we want to generate a set of 3D points that represent this distribution. The naive approach for finding such a point cloud is to just sample points from the distribution. Although this approach is guaranteed to find a good representation, it needs many points to cover the distribution evenly. Since methods used in this field can be computationally intensive with cubic or higher time complexity, generating a point cloud that covers the given distribution with a smaller point-cloud size leads to a significant improvement in their runtime.

In this approach, we present two methods for generating a point cloud from a cryo-EM density map or a distribution in general. The first one is based on the Topological Representing Network (TRN) (Martinetz and Schulten 1994) and the second one combines the usage of the Optimal Transport (OT) (Peyré, Cuturi, et al. 2019) theory and a computational geometry object named Centroidal Voronoi Tessellation (CVT).

Data

For the sake of simplicity in this post, we assume we are given a primal distribution over \(\mathbb{R}^2\). As an example, we will work on a multivariate Gaussian distribution that it’s domain is limited to \([0, 1]^2\). The following code prepares and illustrates the pdf of the example distribution.

import numpy as np
import scipy as scp
import matplotlib
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (20,20)



mean = np.array([0,0])
cov = np.array([[0.5, 0.25], [0.25, 0.5]])
distr = scp.stats.multivariate_normal(cov = cov, mean = mean, seed = 1)


fig, ax = plt.subplots(figsize=(8,8))
im = ax.imshow([[distr.pdf([i/100,j/100]) for i in range(100,-100,-1)] for j in range(-100,100)], extent=[-1, 1, -1, 1])
cbar = ax.figure.colorbar(im, ax=ax)
plt.title("The pdf of our primal distribution")
plt.show()

Both of the methods that we are going to cover are iterative methods relying on an initial sample of points. For generating a point cloud with size \(n\), they begin by randomly sampling \(n\) points and refining it over iterations. We use \(n=200\) in our examples.

def sampler(rvs):
    while True:
        sample = rvs(1)
        if abs(sample[0]) > 1 or abs(sample[1]) > 1:
            continue
        return sample

initial_samples = []
while len(initial_samples) < 200:
    sample = sampler(distr.rvs)
    initial_samples.append(list(sample))
initial_samples = np.array(initial_samples)

l = list(zip(*initial_samples))
x = list(l[0])
y = list(l[1])

fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(x, y)
ax.plot((-1,-1), (-1,1), 'k-')
ax.plot((-1,1), (-1,-1), 'k-')
ax.plot((1,1), (1,-1), 'k-')
ax.plot((-1,1), (1,1), 'k-')
plt.ylim(-1.1,1.1)
plt.xlim(-1.1,1.1)
plt.xticks([])
plt.yticks([])
plt.show()

Topology Representing Networks (TRN)

TRN is an iterative method that relies on randomly sampling an initial point cloud \(r_m(0)_{i=1,\dots,n}\) from the given probability distribution \(p\). At each step \(t\), they sample a new point (\(r_t\)) from \(p\) and compute the distance from points in \(r_m(t)\) to \(r_t\) and rank them from zero (closest) to \(n-1\) (called \(k_m\)). Then they update the position of points based on: \[r_m(t+1) = r_m(t) + \epsilon(t)exp(-k_m/\lambda(t))(r_t - r_m(t)),\] \[\epsilon(t) = \epsilon_0(\frac{\epsilon_f}{\epsilon_0})^{t/t_f},\] \[\lambda(t) = \lambda_0(\frac{\lambda_f}{\lambda_0})^{t/t_f}\] These equations are designed in a way that moves points slower as the number of iterations increases.

e0=0.5
ef=0.05
l0=1
lf=0.5
tf=2000

fig, axs = plt.subplots(2, 2, figsize=(9.5,9.5))

r = initial_samples
for t in range(tf):
    rt = sampler(distr.rvs)
    dist2 = ((rt - r)**2).sum(1) 
    order = dist2.argsort()
    rank = order.argsort().reshape(-1,1)
    l = l0*(lf/l0)**(t/tf)
    e = e0*(ef/e0)**(t/tf)
    r = r + e*np.exp(-rank/l)*(rt-r)
    
    if (t+1)%500 == 0:
        l = list(zip(*r))
        x = list(l[0])
        y = list(l[1])

        index = t//500
        axs[index//2][index%2].scatter(x, y, s=10)
        axs[index//2][index%2].title.set_text('Position of points after t=%d iterations'%(t+1,))
        axs[index//2][index%2].plot((-1,-1), (-1,1), 'k-')
        axs[index//2][index%2].plot((-1,1), (-1,-1), 'k-')
        axs[index//2][index%2].plot((1,1), (1,-1), 'k-')
        axs[index//2][index%2].plot((-1,1), (1,1), 'k-')
        plt.ylim(-1.1,1.1)
        plt.xlim(-1.1,1.1)
        axs[index//2][index%2].set_xticks([])
        axs[index//2][index%2].set_yticks([])
plt.show()

Centroidal Vornoi Tessellation (CVT)

Although TRN is intuitive it doesn’t minimize any specific objective function. Among the metrics that can be for determining the distance between a point cloud and a continuous distribution the semidiscrete Wasserstein distance (based on the Optimal Transport theory (Peyré, Cuturi, et al. 2019)) is of our interest. In other words, we want a point cloud that minimizes the semidiscrete Wasserstein distance to a given primal distribution. One can prove that such a point cloud forms a geometrical object named Centroidal Voronoi Tessellation (CVT) over the distribution. A CVT is a Voronoi diagram generated by a point cloud such that each point is centroid and generator of it’s Voronoi cell. More details about this object will be covered in future posts. Such a tessellation can be computed using Lloyd’s iterations by alternating between computing centroids and Voronoi cells. Unlike TRN this method generated a weighted point cloud.

def in_box(robots, bounding_box):
    return np.logical_and(np.logical_and(bounding_box[0] <= robots[:, 0],
                                         robots[:, 0] <= bounding_box[1]),
                          np.logical_and(bounding_box[2] <= robots[:, 1],
                                         robots[:, 1] <= bounding_box[3]))


def voronoi(robots, bounding_box):
    i = in_box(robots, bounding_box)
    points_center = robots[i, :]
    points_left = np.copy(points_center)
    points_left[:, 0] = bounding_box[0] - (points_left[:, 0] - bounding_box[0])
    points_right = np.copy(points_center)
    points_right[:, 0] = bounding_box[1] + (bounding_box[1] - points_right[:, 0])
    points_down = np.copy(points_center)
    points_down[:, 1] = bounding_box[2] - (points_down[:, 1] - bounding_box[2])
    points_up = np.copy(points_center)
    points_up[:, 1] = bounding_box[3] + (bounding_box[3] - points_up[:, 1])
    points = np.append(points_center,
                       np.append(np.append(points_left,
                                           points_right,
                                           axis=0),
                                 np.append(points_down,
                                           points_up,
                                           axis=0),
                                 axis=0),
                       axis=0)
    # Compute Voronoi
    vor = scp.spatial.Voronoi(points)
    # Filter regions and select corresponding points
    regions = []
    points_to_filter = [] # we'll need to gather points too
    ind = np.arange(points.shape[0])
    ind = np.expand_dims(ind,axis= 1)

    for i,region in enumerate(vor.regions): # enumerate the regions
        if not region: # nicer to skip the empty region altogether
            continue

        flag = True
        for index in region:
            if index == -1:
                flag = False
                break
            else:
                x = vor.vertices[index, 0]
                y = vor.vertices[index, 1]
                if not(bounding_box[0] - eps <= x and x <= bounding_box[1] + eps and
                       bounding_box[2] - eps <= y and y <= bounding_box[3] + eps):
                    flag = False
                    break
        if flag:
            regions.append(region)

            # find the point which lies inside
            points_to_filter.append(vor.points[vor.point_region == i][0,:])

    vor.filtered_points = np.array(points_to_filter)
    vor.filtered_regions = regions
    return vor

def centroid_region(vertices):
    A = 0
    C_x = 0
    C_y = 0
    for i in range(len(vertices)):
        p = distr.pdf(vertices[i])
        A += p
        C_x += vertices[i,0] * p
        C_y += vertices[i,1] * p
        
    C_x /= A
    C_y /= A
    return np.array([[C_x, C_y]]), A

def plot(r,ax):
    vor = voronoi(r, bounding_box)
    ax.set_xlim([-1.1, 1.1])
    ax.set_ylim([-1.1, 1.1])
    
    for region in vor.filtered_regions:
        vertices = vor.vertices[region + [region[0]], :]
        ax.plot(vertices[:, 0], vertices[:, 1], 'k-')
        
    centroids = []
    weights = []
    
    for region in vor.filtered_regions:
        vertices = vor.vertices[region + [region[0]], :]
        centroid, w = centroid_region(vertices)
        centroids.append(list(centroid[0, :]))
        weights.append(w)
        
    
    ax.scatter(vor.filtered_points[:, 0], vor.filtered_points[:, 1], s=5, c='b', alpha=weights/max(weights))
        
    ax.set_xticks([])
    ax.set_yticks([])
        
    centroids = np.asarray(centroids)
    return centroids, weights

import sys
bounding_box = np.array([-1., 1., -1., 1.]) 
eps = sys.float_info.epsilon
samples = initial_samples
fig, axs = plt.subplots(3,3,figsize=(9,9))
for i in range(9):
    axs[i//3][i%3].title.set_text('iteration t=%d'%(i + 1,))
    centroids, weights = plot(samples,axs[i//3][i%3])
    samples = np.copy(centroids)

plt.show()

More Examples

To further examine the effectiveness of these methods, we performed a simulation on a more complex distribution obtained by normalizing the intensities of a sketch of Naqsh-e Jahan Square and \(n=10^5\) points. This image shows the convergence of methods as well as the primal distribution.

Covering a more complex distribution with point clouds generated by TRN and CVT.

Application on Cryo-EM

Both of these methods are easily applicable to a 3D density map. A full implementation of both methods in ChimeraX (the standard visualization tool for cryo-EM) (Pettersen et al. 2021) is in this GitHub repo. TRN was first used in the field of cryo-EM by (Zhang et al. 2021), later on, we used it in our alignment methods Riahi, Zhang, et al. (2023). An example of its performance on cryo-EM density map EMDB:1717 is illustrated below. To the best of our knowledge, no paper has used CVT in the field of cryo-EM yet.

An example of covering EMDB:1717 with 200 points using TRN.

References

Han, Xusi, Genki Terashi, Charles Christoffer, Siyang Chen, and Daisuke Kihara. 2021. “VESPER: Global and Local Cryo-EM Map Alignment Using Local Density Vectors.” Nature Communications 12 (1): 2090.

Kawabata, Takeshi. 2008. “Multiple Subunit Fitting into a Low-Resolution Density Map of a Macromolecular Complex Using a Gaussian Mixture Model.” Biophysical Journal 95 (10): 4643–58.

Martinetz, Thomas, and Klaus Schulten. 1994. “Topology Representing Networks.” Neural Networks 7 (3): 507–22.

Pettersen, Eric F, Thomas D Goddard, Conrad C Huang, Elaine C Meng, Gregory S Couch, Tristan I Croll, John H Morris, and Thomas E Ferrin. 2021. “UCSF ChimeraX: Structure Visualization for Researchers, Educators, and Developers.” Protein Science 30 (1): 70–82.

Peyré, Gabriel, Marco Cuturi, et al. 2019. “Computational Optimal Transport: With Applications to Data Science.” Foundations and Trends in Machine Learning 11 (5-6): 355–607.

Riahi, Aryan Tajmir, Geoffrey Woollard, Frédéric Poitevin, Anne Condon, and Khanh Dao Duc. 2023. “Alignot: An Optimal Transport Based Algorithm for Fast 3d Alignment with Applications to Cryogenic Electron Microscopy Density Maps.” IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Riahi, Aryan Tajmir, Chenwei Zhang, James Chen, Anne Condon, and Khanh Dao Duc. 2023. “Empot: Partial Alignment of Density Maps and Rigid Body Fitting Using Unbalanced Gromov-Wasserstein Divergence.” arXiv Preprint arXiv:2311.00850.

Zhang, Yan, James Krieger, Karolina Mikulska-Ruminska, Burak Kaynak, Carlos Oscar S Sorzano, José-Marı́a Carazo, Jianhua Xing, and Ivet Bahar. 2021. “State-Dependent Sequential Allostery Exhibited by Chaperonin TRiC/CCT Revealed by Network Analysis of Cryo-EM Maps.” Progress in Biophysics and Molecular Biology 160: 104–20.