The conundrum lies in the computational cost incurred in modelling three-dimensional structures for protein complexes.
Each of these is successively improved to decrease the system’s potential energy. The goal is that the conformation with the lowest energy is a thermodynamically stable state. However, the underlying computations involve deriving a new position for each of the atoms involved, as they are perturbed in each iteration. Recall that even for a moderately sized protein, the number of atoms can be in tens of thousands. The computational overhead is further compounded by the fact that:
The factors combine together to make the protein modelling problem beyond the reach of modest user with limited resources. This includes handheld devices or edge computing.
Let’s solve a puzzle. Blocks of wood must be placed adjacent but not clash with each other. To aid the mutual orientations, thin pegs of wood must snugly fit in their respective notches in the individual pieces. In essence, the puzzle involves,
Where else are similar problems found?
We now revisit the problem of protein complex modelling from the experimental data. The inputs include:
We now formalize the mathematical construct for the problem so that we can analytically investigate the various steps involved.
Distance preserving transformation
One of the first steps is to characterize the set of transformations that the individual subunits have to undergo. Note that they should not suffer any deformation. In other words, the subunits must behave as rigid bodies. Any transformation that the rigid bodies undergo must preserve their internal structure, i.e., inter-atomic distance between points belonging to a single subunit must remain fixed.
It can be shown mathematically that only rigid body motion, composed of rotation/reflection and translation quality for distance preserving transformation. The space of the rigid body orientations is well known to form a special set in mathematics known as the in \(3\)D, denoted by \(\mathbb{SE}(3)\). The set belongs to a larger class known as differential manifold. This is a generalization of euclidean space in higher dimensions. Differentiation operation may be defined on such a set, and consequently, one may model mathematical algorithms to directly optimize on \(\mathbb{SE}(3)\). In other words, the problem of deriving optimal rigid motion, can be directly defined on \(\mathbb{SE}(3)\). Indeed, a large volume of literature exists in this regard.
For our purpose of deriving a \(3\)D structure for the complex, each of the subunits is considered a rigid body. The crosslinks form the upper bounds in distance, which need to be satisfied when the subunits are aligned together.
\(\mathbb{SE}(3)\) is composed of the Cartesian product of two sets, one being the set of rotations in \(3\)D (\(\mathbb{SO}(3)\)) and seconds being the space of \(3\)D coordinates itself (\(\mathbb{R}^3\)). Formally, $$\mathbb{SE}(3) = \mathbb{SO}(3) \ltimes \mathbb{R}^3$$We use this concept to break our problem of complex modelling into two steps. In essence, the computation of optimal mutual orientation between the subunits (each orientation belonging to \(\mathbb{SE}(3)\)) is broken down into two steps. First, an optimization module is designed over the Cartesian coordinate \(\mathbb{R}^3\). The second is over the rotation group \(\mathbb{SO}(3)\).
The underlying idea is to model a hypothetical structure consisting of only the C\(^\alpha\) atoms from various subunits involved in the crosslinks. Synonymous with the wooden puzzle in Fig. 2.1, the support structure forms a scaffold for the subunits to attach with.
Let \(C_1\) and \(C_2\) be the C\(^\alpha\) atoms from subunit-\(1\) and \(2\) involved in the crosslinks. Since the structures of subunits are known, we derive a set of equality bounds between any pair of atoms belonging to \(C_1\) or \(C_2\). Let the equality bound be denoted by \(\mathcal{E}_1\) and \(\mathcal{E}_2\). Let the pairs of atoms forming crosslinks be denoted by \(\mathcal{L}_{12}\) and the magnitude be \(c\). We want to compute a \(3\)D coordinates \(\left\{x_i \right\}_{i \in C_1 \cup \, C_2}\), such that the distance constraints are satisfied.
Mathematically, \[\label{localizationprob} \begin{aligned} \mathrm{Find} &\left\{x_i \right\}_{i \in C_1 \cup \, C_2}\\ \text{subject to} &\quad \|x_i - x_j \|_2^2 = d^2_{ij} \quad & (i,j) \in \mathcal{E}_1 \cup \mathcal{E}_2\\ &\quad \|x_i - x_j \|_2^2 \leqslant \underline{c}^2 \pm \sigma \quad & (i,j) \in \mathcal{L}_{12} \end{aligned}\]
Here, \(\sigma\) is randomly drawn from a normal distribution \(\mathcal{N}(0,\mu)\) to denote the uncertainty of assigning a particular value to the magtitude of the crosslink. The mathematical therefore models the noise in the experimental data. The problem can be shown to be reduceable to a more general problem of Graph Embedding in \(3\)D. This was shown to be NP-hard, i.e., computationaly intractable in polynomial time . Hence we look for indirect approaches to estimate the solution of problem.
Consider a Euclidean distance matrix \(D\). Each cell in the matrix is the distance between a pair of atoms. Refer to fig provided below. The matrix can be divided into two blocks. The lilac and green block contain the equality bounds between the C\(^\alpha\) atoms belonging to \(C_1\) and \(C_2\). The white blocks are empty except for the pairs, for which the upper bound can be found in the list of crosslinks \(\mathcal{L}_{12}\). The matrix there incomplete.
Techniques such as review in (6th Citation), establish approaches to approximate a complete distance matrix, say \(\tilde{D}\), which best approximates the entries defined in \(D\). We found it prone to errors for sparse data, i.e., for a smaller number of C\(^\alpha\) atoms, whereas Sci-py performed significantly better. Interestingly, the trend was found to be the opposite if more atoms were involved. Since the number of crosslinks was in the order of \(10\)-\(12\), Sci-py was found to work best empirically.
We can then derive the coordinates \(X = \left\{x_i \right\}_{i \in C_1 \cup \, C_2}\) from \(\tilde{D}\) by two steps:
Once the supporting structure(s) has been computed, we compute the optimal orientation of the subunits, such that they attach to specified points in the supporting structure. The specific points to which the subunits attaches themselves, if of course defined by the C\(^\alpha\) atoms which constitute the supporting structure. However, the question remains, how does one find an optimal rigid body motion (rotations, translations) which would align all the bodies together.
A go to solution could be a sequence of pairwise alignments. A sequence of pairwise alignment leads to the accumulation of errors. Further, the merging sequence is a combinatorial problem since all may not lead to the same solution. Morevoer, the set of all rigid body motions, \(SE(3)\), is a non-convex set. Hence optimization algorithms cannot be applied direct on such sets.
Let \(y_i\) be the unknown coordinates of the PDB’s oriented together. Let \(x_i\) be coordinates of the individual PDB’s and the hypothetical structure computed in the localization stage. Each of the PDB’s and the hypothetical structure under a rigid body motion (Rotation-\(R_i\) and translation-\(t_i\)) such the C\(^\alpha\) atoms align together. We find a solution to the following problem: \[\label{regeq} \underset{y_i, R_k, t_k} {\text{min}} \quad \sum_{k=1}^m \quad \sum_{i \in \mathcal{C}_k} \|y_i - \left(R_k x^k_i + t_k\right)\|_2^2.\]
In general, directly computing solution over a set of rotation matrices is computationaly intractable since it is a non-smooth non-convex set. However, describes a global alignment approach following a non-linear optimization formulation known as Semidefinite programmimng , to estimate a solution to the following problem. The variables of equation are separated into \(\mathbb{R}^3\) and \(\mathbb{O}(3)\). In order to circumvent the non-convexity of \(\mathbb{O}(3)\), a convex progam is designed over a new variable \(G\) defined as \(G = R^\top R\), i.e, inner product of rotation matrices. This has been employed in as well to derive protein structure from NMR. We apply similar approach to align the coordinates of the subunits and the supporting structure together.
Since, we are working with distance data alone, it is impossible to distinguish mirror images from each other. Consider the registration problem. Consider two \(3\)D points \(y\) and \(x\) such that they are related to each other by, \[\label{2bodyreg} y = R x\]
such that \(R \in \mathbb{SO}(3)\). Define \(I_{-1}\) as the identity matrix with a single diagonal entry replaced by \(-1\). Note that \(I_{-1} R\) is also a solution of equation.
In context of registration problem, the may lead to the chain I, as shown in (a) in the figure above. The corresponding Ramachandran map for the chain I is shown in (d) in the figure above. Hence chain E is reflected as correction. This can be done by noting the rotation matrix, say \(R_{I}\) which was obtained from the registration step, and multiplying any diagonal value by \(-1\). Thus determinant of \(R_{I}\) will change sign, indicating a reflection. The coordinates of chain I, acted upon the by the corrected \(R_{E}\) results shown in (c) in the figure above. The corresponding Ramachandran map shown in (e) in the figure above. Note that no such corrections was required for chain E.
Protein is a dynamic molecule. Their conformations undergo changes due to even minor flexibility of the side chains or the main chain. In order to model such a scenario, we generate multiple conformations as described in and summarized in Fig. 2.5. The number of crosslinks are varied in each run. Their number were taken to vary between \(6\)-\(8\) for each case. The localization step may lead to multiple solutions. In order to explore the space of valid conformational spaces, the localization step was computed multiple times varying the seed value. The structures are then registered. Note that each iteration is computationaly fast. Morevers, the executations can progress parallely. This further reduces the computation time.
In order to prune the exponential increase of solution space, the structure are evaluated for crosslink violations and clashes. The ones falling below a pre-determined thresshold are discarded. Also the whole process terminates beyond a user defined number of iterations. The final structures obtaiend are ranked on the basis of crosslink violation fraction and the energy score, using PyRosetta, computed for each of them.
Pseudocode for IMPROViSeD. Modelling three dimensional structures for protein complexes using crosslinking is broken down into two steps: Localization and Registration. The subunits are oriented in multiple iterations. Each results from a subset of crosslinks, followed by localiztion and registration. The iterations proceed in parallel. The structures with low crosslink violations and those that do not have clashes make it to as final result.
We perform only a short EM (Energy minimization) step for a crude orientation. The user may choose to employ detailed steps such as