We demonstrate the results of our pipeline in two cases: one using experimental crosslinking data and the other using synthetic crosslinking data. Value of the crosslinking distance varies with the cross linker used. In the cases discussed below, the value is taken to be 30 Angstorms. In the first case, both protein structures in the complex consist of a single chain. In contrast, the second case involves a dimer and a monomer forming a complex. The data for the second case was synthetically generated, as detailed in the "LCN2-MMP9 Case" section.

Ribonuclease Inhibitor Complexed With Ribonuclease A (PDB 1DFJ)

We are working with the protein identified by PDB ID 1DFJ, which consists of two chains, as illustrated below. Additionally, we have experimental crosslinking data for this protein obtained from IMP.

Human Neutrophil Gelatinase-Associated Lipocalin (HNGAL) (PDB ID:LCN2) - Hydrolase/Hydrolase Inhibitor (PDB ID: MMP9)

MMP9 has 2 chains and LCN2 has 1 chain. We compare the results obtained from IMPROViSeD with the structures for the complex downloaded from PDB.

We do not have experimental crosslinks data available in this case. Hence we have generated synthetic crosslinks data using the following algorithm.

Algorithm for synthetic crosslink generation

Running the pipeline

Number of crosslinks:

1DFJ: 12 (experimental)
LCN2-MMP9: 11 (synthetically generated)

All vs Subset of Crosslinks

We ran our pipeline by choosing random subsets of crosslinks, since we are solving a localization problem to form the supporting framework for the two bodies. Note that this being inherently non-convex, the solution is not unique (depends on random seed). In fact, we use the same to our advantage to generate multiple structures for the supporting framework by starting with random seeds. Note that the execution time is less than a minute.

We also tried by choosing all crosslinks, but that resulted in more clashes. The reason for this is that the magnitude of the crosslinking distance is not an absolute value ( additionally, the presence of alternately organised complexes cannot be ruled out ). It is dependent on the flexibility of the sidechain and the backbone of the protein. Hence, the distance of 30 Angstroms varies. Moreover, the crosslinks denote the distance between the residues, while IMPROViSeD uses the distance between the C\(^\alpha\) atoms. We thus add a tolerance value of 5 Angstroms to the crosslink distance, while evaluating violations.

Evaluation

The results are obtained by running IMPROViSeD, is tested for:

Backbone Root Mean Square Distance(RMSD) with known structure in PDB repository: This indicates how similar or distinct are the structures modelled by IMPROViSeD
Number of crosslink violations to assess the accuracy of structure modelling with respect to the experimental data.

We get different solutions to the localization problem for different runs, which results in different structures following the registration step. We evaluated these structures based on percentage of correctly satisfied crosslinks. We also report the backbone RMSD for each of the cases with respect to the structure found in PDB. We consider a crosslink to be satisfied if it is less than 30+5 Angstorms as per the criterion defined above. In our case crosslink distance is between C\(^\alpha\) atoms between specific residues of 2 chains.

Software Experiments