Challenges in Integrated Modelling

The conundrum lies in the computational cost incurred in modelling three-dimensional structures for protein complexes.

The challenges stem from sampling the entire space of valid protein conformations for instances that satisfy the experimental evidence.
Under-sampling of the conformational space risks missing one or more valid solutions. This may have far-reaching consequences.
A conservation approach is therefore adopted by beginning with a sizable number, say $k$, of initial orientations for the protein sub-units forming the complexes. The guess must be randomly selected across the space of valid protein conformations. This is easier said than accomplished. Indeed, a large volume of work exists to sample this space.

Each of these is successively improved to decrease the system’s potential energy. The goal is that the conformation with the lowest energy is a thermodynamically stable state. However, the underlying computations involve deriving a new position for each of the atoms involved, as they are perturbed in each iteration. Recall that even for a moderately sized protein, the number of atoms can be in tens of thousands. The computational overhead is further compounded by the fact that:

While parallel computations are employed to mitigate the overhead, the increased demands on the computational resources are substantially large.
Each run originating from a single guess results in an ensemble of protein structures.
Branch and bound and/or clustering methods are designed to quell the exponential increment of the solutions space.

The factors combine together to make the protein modelling problem beyond the reach of modest user with limited resources. This includes handheld devices or edge computing.

Motivation

Why random guesses when data says otherwise. We intend to find initial orientations such that the experimental evidence is satisfied.
How sensitive are the relative orientations of the proteins, given errors in the experimentally derived crosslinks?
How sensitive are the relative orientations of the protein molecules for tailor-made contact points?
If multiple orientations are possible? Can we predict new interfaces if they can be obtained with limited computation cost? This would be useful in artificial drug design.

Method

Conceptualize (Cycle-1)

Puzzle to align pieces of wood such that thin struts between them fall into specific notches.

Let’s solve a puzzle. Blocks of wood must be placed adjacent but not clash with each other. To aid the mutual orientations, thin pegs of wood must snugly fit in their respective notches in the individual pieces. In essence, the puzzle involves,

Bring two bodies together such that they do not deform.
Orientate them such that separation between certain anchor points remains fixed.
Perform minor adjustments so that any errors are ironed out.

Where else are similar problems found?

Multiview registration
DREAM
Sensor network localization
Robot position determination

We now revisit the problem of protein complex modelling from the experimental data. The inputs include:

Three-dimensional coordinate files for the subunits of the protein complex.
Experimental evidence, i.e., the crosslink file indicating the residues belonging to different subunits that are proximal to each other. The cutoff distance depends on the experimental setup.

We now formalize the mathematical construct for the problem so that we can analytically investigate the various steps involved.

Mathematical modelling

Distance preserving transformation

One of the first steps is to characterize the set of transformations that the individual subunits have to undergo. Note that they should not suffer any deformation. In other words, the subunits must behave as rigid bodies. Any transformation that the rigid bodies undergo must preserve their internal structure, i.e., inter-atomic distance between points belonging to a single subunit must remain fixed.

Problems similar to complex modelling. Multiview registration attempts to stitch images together (B) using different camera poses (A). LIDAR data is used in autonomous vehicles to simultaneously estimate their position with respect to its surroundings as shown in (C). DREAM uses the bottom-up strategy of modelling sub-structures before registering them into a single structure.

It can be shown mathematically that only rigid body motion, composed of rotation/reflection and translation quality for distance preserving transformation. The space of the rigid body orientations is well known to form a special set in mathematics known as the in $3$D, denoted by $\mathbb{SE}(3)$. The set belongs to a larger class known as differential manifold. This is a generalization of euclidean space in higher dimensions. Differentiation operation may be defined on such a set, and consequently, one may model mathematical algorithms to directly optimize on $\mathbb{SE}(3)$. In other words, the problem of deriving optimal rigid motion, can be directly defined on $\mathbb{SE}(3)$. Indeed, a large volume of literature exists in this regard.

For our purpose of deriving a $3$D structure for the complex, each of the subunits is considered a rigid body. The crosslinks form the upper bounds in distance, which need to be satisfied when the subunits are aligned together.

$\mathbb{SE}(3)$ is composed of the Cartesian product of two sets, one being the set of rotations in $3$D ($\mathbb{SO}(3)$) and seconds being the space of $3$D coordinates itself ($\mathbb{R}^3$). Formally, $$\mathbb{SE}(3) = \mathbb{SO}(3) \ltimes \mathbb{R}^3$$We use this concept to break our problem of complex modelling into two steps. In essence, the computation of optimal mutual orientation between the subunits (each orientation belonging to $\mathbb{SE}(3)$) is broken down into two steps. First, an optimization module is designed over the Cartesian coordinate $\mathbb{R}^3$. The second is over the rotation group $\mathbb{SO}(3)$.

Implementation (Cycle-2)

The underlying idea is to model a hypothetical structure consisting of only the C$^\alpha$ atoms from various subunits involved in the crosslinks. Synonymous with the wooden puzzle in Fig. 2.1, the support structure forms a scaffold for the subunits to attach with.

Localization

Let $C_1$ and $C_2$ be the C$^\alpha$ atoms from subunit-$1$ and $2$ involved in the crosslinks. Since the structures of subunits are known, we derive a set of equality bounds between any pair of atoms belonging to $C_1$ or $C_2$. Let the equality bound be denoted by $\mathcal{E}_1$ and $\mathcal{E}_2$. Let the pairs of atoms forming crosslinks be denoted by $\mathcal{L}_{12}$ and the magnitude be $c$. We want to compute a $3$D coordinates $\left\{x_i \right\}_{i \in C_1 \cup \, C_2}$, such that the distance constraints are satisfied.

Mathematically, \[\label{localizationprob} \begin{aligned} \mathrm{Find} &\left\{x_i \right\}_{i \in C_1 \cup \, C_2}\\ \text{subject to} &\quad \|x_i - x_j \|_2^2 = d^2_{ij} \quad & (i,j) \in \mathcal{E}_1 \cup \mathcal{E}_2\\ &\quad \|x_i - x_j \|_2^2 \leqslant \underline{c}^2 \pm \sigma \quad & (i,j) \in \mathcal{L}_{12} \end{aligned}\]

Here, $\sigma$ is randomly drawn from a normal distribution $\mathcal{N}(0,\mu)$ to denote the uncertainty of assigning a particular value to the magtitude of the crosslink. The mathematical therefore models the noise in the experimental data. The problem can be shown to be reduceable to a more general problem of Graph Embedding in $3$D. This was shown to be NP-hard, i.e., computationaly intractable in polynomial time . Hence we look for indirect approaches to estimate the solution of problem.

Consider a Euclidean distance matrix $D$. Each cell in the matrix is the distance between a pair of atoms. Refer to fig provided below. The matrix can be divided into two blocks. The lilac and green block contain the equality bounds between the C$^\alpha$ atoms belonging to $C_1$ and $C_2$. The white blocks are empty except for the pairs, for which the upper bound can be found in the list of crosslinks $\mathcal{L}_{12}$. The matrix there incomplete.

Techniques such as review in (6th Citation), establish approaches to approximate a complete distance matrix, say $\tilde{D}$, which best approximates the entries defined in $D$. We found it prone to errors for sparse data, i.e., for a smaller number of C$^\alpha$ atoms, whereas Sci-py performed significantly better. Interestingly, the trend was found to be the opposite if more atoms were involved. Since the number of crosslinks was in the order of $10$-$12$, Sci-py was found to work best empirically.

We can then derive the coordinates $X = \left\{x_i \right\}_{i \in C_1 \cup \, C_2}$ from $\tilde{D}$ by two steps:

We derive $\tilde{X}$ by Cholesky’s decomposition of $\tilde{D}$, i.e., $$\tilde{D} = \tilde{X}^\top \tilde{X}$$
Note that if $\tilde{D}$, were a complete matrix, we would have obtained $\tilde{X}$ as $3$D coordinates. However, since it is incomplete there are no such guarantees. In fact, for lower number of crosslinks, or ambiguities in them, $\tilde{X}$ belongs to a higher dimension.
We project $\tilde{X}$ back to $3$D space by solving the problem \[\begin{aligned} \underset{X}{\mathrm{min}} \quad&\|X - \tilde{X}\|_F^2 \\ \text{subject to} \quad& \mathrm{rank}(X) = 3. \end{aligned}\] This is the well-known Young-Egkart theorem and has a closed-formed solution by the singular value decomposition of $\tilde{D}$.

Euclidean distance matrix (EDM). Each cell in the matrix denotes a pairwise distance making EDM symmetric. While the coloured blocks are filled by equality bounds $\mathcal{E}_1, \mathcal{E}_2$, the white block is largely empty. Only crosslinks entries $\mathcal{L}_{12}$ contribute to it. Localization problem can be modelled as completing such an incomplete EDM. $\sigma$ is a noise drawn from normal distribition $\mathcal{N}(0,\mu)$

Registration

Once the supporting structure(s) has been computed, we compute the optimal orientation of the subunits, such that they attach to specified points in the supporting structure. The specific points to which the subunits attaches themselves, if of course defined by the C$^\alpha$ atoms which constitute the supporting structure. However, the question remains, how does one find an optimal rigid body motion (rotations, translations) which would align all the bodies together.

Pairwise versus global

Consequences of Pairwise alignment , the image shows 4 sub units being resgistered (aligned) together by a sequence of pairwise alignment. Difference possibilities exist based on the sequence choosen, not all of which are valid protein structures. Thus one has to iterate through all the possibilities for the best result adding computational overhead.

A go to solution could be a sequence of pairwise alignments. A sequence of pairwise alignment leads to the accumulation of errors. Further, the merging sequence is a combinatorial problem since all may not lead to the same solution. Morevoer, the set of all rigid body motions, $SE(3)$, is a non-convex set. Hence optimization algorithms cannot be applied direct on such sets.

Let $y_i$ be the unknown coordinates of the PDB’s oriented together. Let $x_i$ be coordinates of the individual PDB’s and the hypothetical structure computed in the localization stage. Each of the PDB’s and the hypothetical structure under a rigid body motion (Rotation-$R_i$ and translation-$t_i$) such the C$^\alpha$ atoms align together. We find a solution to the following problem: \[\label{regeq} \underset{y_i, R_k, t_k} {\text{min}} \quad \sum_{k=1}^m \quad \sum_{i \in \mathcal{C}_k} \|y_i - \left(R_k x^k_i + t_k\right)\|_2^2.\]

In general, directly computing solution over a set of rotation matrices is computationaly intractable since it is a non-smooth non-convex set. However, describes a global alignment approach following a non-linear optimization formulation known as Semidefinite programmimng , to estimate a solution to the following problem. The variables of equation are separated into $\mathbb{R}^3$ and $\mathbb{O}(3)$. In order to circumvent the non-convexity of $\mathbb{O}(3)$, a convex progam is designed over a new variable $G$ defined as $G = R^\top R$, i.e, inner product of rotation matrices. This has been employed in as well to derive protein structure from NMR. We apply similar approach to align the coordinates of the subunits and the supporting structure together.

Ramachandran flip (PDB ID 1DFJ). Working with distance data results in mirrored structures. Although mathematically valid, they are biologically invalid. Chain-I in (a) with its Ramachandran map shown in (d) is such a case. We reflect the same to result in (c). Ramachandran map is shown in (e). Chain-E in (a) or (c) require no such flip as seen from the plot in (b).

Since, we are working with distance data alone, it is impossible to distinguish mirror images from each other. Consider the registration problem. Consider two $3$D points $y$ and $x$ such that they are related to each other by, \[\label{2bodyreg} y = R x\]

such that $R \in \mathbb{SO}(3)$. Define $I_{-1}$ as the identity matrix with a single diagonal entry replaced by $-1$. Note that $I_{-1} R$ is also a solution of equation.

In context of registration problem, the may lead to the chain I, as shown in (a) in the figure above. The corresponding Ramachandran map for the chain I is shown in (d) in the figure above. Hence chain E is reflected as correction. This can be done by noting the rotation matrix, say $R_{I}$ which was obtained from the registration step, and multiplying any diagonal value by $-1$. Thus determinant of $R_{I}$ will change sign, indicating a reflection. The coordinates of chain I, acted upon the by the corrected $R_{E}$ results shown in (c) in the figure above. The corresponding Ramachandran map shown in (e) in the figure above. Note that no such corrections was required for chain E.

Iterating (Cycle-4)

Simplified Flowchart for IMPROViSeD. The subunits are oriented in multiple iterations. Each resulting from a subset of crosslinks, followed by localization and registration. The iterations proceed in parallel. The structures without clashes while satisfying the crosslinks gives final results.

Protein is a dynamic molecule. Their conformations undergo changes due to even minor flexibility of the side chains or the main chain. In order to model such a scenario, we generate multiple conformations as described in and summarized in Fig. 2.5. The number of crosslinks are varied in each run. Their number were taken to vary between $6$-$8$ for each case. The localization step may lead to multiple solutions. In order to explore the space of valid conformational spaces, the localization step was computed multiple times varying the seed value. The structures are then registered. Note that each iteration is computationaly fast. Morevers, the executations can progress parallely. This further reduces the computation time.

In order to prune the exponential increase of solution space, the structure are evaluated for crosslink violations and clashes. The ones falling below a pre-determined thresshold are discarded. Also the whole process terminates beyond a user defined number of iterations. The final structures obtaiend are ranked on the basis of crosslink violation fraction and the energy score, using PyRosetta, computed for each of them.

Post-processing (Cycle-4)

Pseudocode for IMPROViSeD. Modelling three dimensional structures for protein complexes using crosslinking is broken down into two steps: Localization and Registration. The subunits are oriented in multiple iterations. Each results from a subset of crosslinks, followed by localiztion and registration. The iterations proceed in parallel. The structures with low crosslink violations and those that do not have clashes make it to as final result.

Psuedocode for IMPROViSeD. The subunits are oriented in multiple iterations. Each resulting from a subset of crosslinks, followed by localization and registration. The iterations proceed in parallel. The structures without clashes while satisfying the crosslinks gives final results.

We perform only a short EM (Energy minimization) step for a crude orientation. The user may choose to employ detailed steps such as

Side chain EM
Main chain EM
Simulated annealing

References

Daniel Russel, Keren Lasker, Ben Webb, Javier Vel´azquez-Muriel, Elina Tjioe, Dina Schneidman-Duhovny, Bret Peterson, and Andrej Sali. Putting the pieces together: in- tegrative modelling platform software for structure determination of macromolecular as- semblies. PLoS Biol., 10(1):e1001244, January 2012.
Cyril Dominguez, Rolf Boelens, and Alexandre MJJ Bonvin. Haddock: a protein- pro- tein docking approach based on biochemical or biophysical information. Journal of the American Chemical Society, 125(7):1731–1737, 2003.
Serge Lang. Differential manifolds, volume 2. Springer, 1972.
Nathan Jacobson. Lie algebras. Courier Corporation, 2013.
Arnaud de Mesmay, Yo’av Rieck, Eric Sedgwick, and Martin Tancer. embeddability in R3 is NP-hard. SODA ’18, pages 1316–1329, 2018
Leo Liberti, Carlile Lavor, Nelson Maculan, and Antonio Mucherino. Euclidean distance geometry and applications. SIAM Review, 56(1):3–69, 2014.
Bamdev Mishra, Gilles Meyer, and Rodolphe Sepulchre. Low-rank optimization for distance matrix completion. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, pages 4455–4460. IEEE, 2011.
Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2001.
K N Chaudhury, Y Khoo, and A Singer. Global registration of multiple point clouds using semidefinite programming. SIAM Journal on Optimization, 25(1):468–501, 2015.
Niladri R. Das, Kunal N. Chaudhury, and Debnath Pal. Improved nmr-data-compliant protein structure modelling captures context-dependent variations and expands the scope of functional inference. Proteins: Structure, Function, and Bioinformatics, 91(3):412–435, 2023.
Niladri Ranajan Das, Kunal Narayan Chaudhury, and Debnath Pal. DREAMweb: An online tool for graph-based modelling of NMR protein structure. Proteomics, page e2300379, April 2024.
Zhang Yichi, Jindal Muskaan, Viswanath Shruthi, and Sitharam Meera. Integrative docking benchmark, 2020

Engineering