Model

Introduction

Gaining an average spatial perspective of the experimental process and comprehending the interactions between various reaction components are crucial steps in the design of any experiment. Given the extensive nature of our project with its three main branches, we tried to address some of the key aspects that puzzled us. The first and most strenuous effort was understanding the interactions between the protein domains and tweaking the chimeric autoantibody receptor structure. The second effort was designing and modelling the single-stranded and double stranded molecules of the RNA release mechanism, along with all the strand displacement reactions and the hybridisation chain reaction.

In Silico Chimeric Autoantibody Receptor Μodeling

We based our initial design on studies by Castella et al. (2019), Rafiq et al. (2020), Ellebrecht et al. (2020), and Chang & Chen (2017). These studies seemed to converge on some features, like the importance of a CD8a transmembrane domain (BBa_K5310008), or the costimulatory domains 4-1-BB (BBa_K5310010) and CD3z (BBa_K5310011).

Without the aid of software, it can take years to predict the structure of one single protein. Consequently, we sought to use the best computational tools we could find.

PDB

We cross-referenced our sequences from NCBI with the Protein Data Bank (PDB), a publicly accessible database that stores and shares three-dimensional biomolecular structure data including density, domains, atoms of interest, distances and oligomer formation. We found our transmembrane (see Parts: CD8a, CD4 and ICOS transmembrane, CD8a hinge) and cytoplasmic domains (CD28, CD137, CD3z, OX40, ICOS, CD27, MYD88-CD40 and KIR2DS2 signaling).

Visualisation: PyMOL

Our initial steps while creating a fusion protein like our CAAR were cautious. We met with a structural biologist, Mrs. Fadouloglou, who immediately suggested PyMOL (Molecular Graphics System, Version 3.0 Schrödinger, LLC.), a graphical tool to visualize these structures and assess their folding patterns (see Human Practices). We initially used it to test our protein linkers and hinge domains to ensure receptor flexibility and proper positioning inside and outside the cell. Then, we also attempted to evaluate the possibility of our CAAR creating dimers, to no avail. Sadly, our educational-use PyMOL build does not allow the visual reproduction, so we have employed other programs for the visualizations for the wiki.

AlphaFoldV2

AlphaFold2 (AF2) is a high-accuracy artificial intelligence system for tertiary structure prediction from an amino acid sequence input. It has been widely used in computational biology and allowed us to test potential structures for our CAAR domains. When testing for MBP-CAAR-v2, five models (PDB files and two sets of data) were generated by the program to assess the accuracy of the models that were predicted. The "predicted local distance difference test" (pLDDT), which had average scores of 79 and 77 for the top two models, was one of these credibility scores. Protein main chain predictions with values between 70 and 90 are considered to be highly accurate. Every residue of MBP-CAAR-v2 for each of the five models' pLDDT values was also shown on a graph, where a score between 0 and 100 was plotted against the location of the corresponding amino acid.

Sequence coverage from PDB for the MBP-CAAR-v2
Figure: Sequence coverage from PDB for the MBP-CAAR-v2.
Figure: Scores for models rated 1 through 5 based on prediction aligned error (PAE). For every pair of residues, this score shows the computed error of the projected distance. The positions of the various amino acids are shown on both axes. As seen in the right bar, the uncertainty in the expected distance between two amino acids is color coded from blue (0 angstrom) to red (30 angstrom). The error in the estimated distance between these two residues is indicated by the color of the intersection of a horizontal line drawn from one amino acid's position on the y-axis and a vertical line drawn from another amino acid's position on the x-axis. A diagonal blue line characterizes PAE graphs in all cases because amino acids that are adjacent in the main sequence are likewise adjacent in the three-dimensional structure. The error in the distances inside and between the domains remains constant for all models. Keep in mind that since all of the models can be easily layered, the differences between them are quite slight.
Figure: Predicted scores for each point on the local distance difference test (LDDT) for each of the five models produced by alphafold2. Plotting the amino acid location versus the anticipated LDDT has been performed. Protein main chain predictions with values between 70 and 90 are considered to be highly accurate. pLDDT values greater than 90 denote very high precision, matching experimentally determined structures. A lesser accuracy is indicated by values between 50 and 70, although it is still likely that the predictions of individual secondary structures are accurate

Predicted alignment error (PAE) provides a more accurate examination of the prediction's credibility for the distances between domains or chains than the pLDDT score, which is generally better suited to show the prediction's credibility for individual domains.

RoseTTAFold

After finding the domain sequences, determining their positioning and adding signal peptides and linkers, we were ready to test our protein in silico. We chose to start from Robetta, a platform that offers rapid and accurate deep-learning algorithms, similar to RoseTTAFold 1.

All the below model plots have trends and fluctuations.

Trends:

  • There are noticeable peaks above 20 angstroms in the plots (Figures in moving bar). These mostly correspond to areas in the protein where the structural model is less reliable. This could be due to flexibility (huge peak at the beginning of the linker region, aa no.56), disordered regions (weakly designed transitions from one domain to another), or insufficient experimental data. In future endeavors, we could experiment extensively to test increased stability configurations for the transitioning regions. We have implemented a change in the linker for that reason (see Engineering success).
  • In contrast, valleys in the plot (below 10 angstroms) suggest parts of the protein where the structure is more accurately determined, possibly due to well-defined secondary structures like helices or sheets that are easier to model.

Fluctuations:

  • The error estimate fluctuates throughout the sequence, indicating that different regions of the protein have varying degrees of modeling confidence. Our understanding is that large fluctuations might correspond to transitions between structured and unstructured regions, e.g., the linker-to-CD8a transition, or between core and surface-exposed residues. Surface-exposed residues exist in the autoantibody binding domain as well as in the three signaling domains.

MBP-CAAR-v2

MBP-CAAR-v1

MOG-CAAR-v1

Image Set 1 Image Set 1

Plot 1 has the highest errors at linker-hinge and transmembrane-signaling transitions.

Image Set 2 Image Set 2

Plot 2has higher confidence for the intracellular domains and high errors at the linker-hinge and hinge-transmembrane

Image Set 3 Image Set 3

Plot 3 has greater error estimates for the intracellular domain configurations, but better estimation of the linker-to-CD8a hinge transition.

Image Set 4 Image Set 4

Plot 4 is more consistent with intracellular domain configuration, with normal errors at the transitions (insufficient experimental data available to the algorithm).

Image Set 5 Image Set 5

Plot 5: Highest overall error configuration, with entangled intracellular domains

Image Set 1 Image Set 1
Image Set 2 Image Set 2
Image Set 3 Image Set 3
Image Set 4 Image Set 4
Image Set 5 Image Set 5
Image Set 1 Image Set 1
Image Set 2 Image Set 2
Image Set 3 Image Set 3
Image Set 4 Image Set 4
Image Set 5 Image Set 5

SWISS-MODEL

The next computational tool we used was SWISS-MODEL for protein homology modeling with minimal input requirements. From it we retrieved the Ramachandran plots, MolProbity structure validation, Quality Estimates, but it failed to assess membrane prediction. It is possible that the membrane prediction failure can be attributed to the chimeric nature of the protein, with not-so-clear transmembrane segments (the CD8a transmembrane is a single helix). Another hypothesis is that more than one CAAR domain gets confused as putative transmembrane domains due to the uncharted structure of the protein.

Ramachandran Plot

Theoretical representation of a Ramachandran plot
Image: Theoretical representation of a Ramachandran plot

Α Ramachandran plot is used to analyze the backbone dihedral angles (Φ and Ψ) of amino acid residues in protein structures.

The plot has green contour regions which represent favored or allowed conformations of amino acid residues. These regions correspond to specific secondary structures. The top left region (around Φ -60o and Ψ 150o) is associated with beta sheets. The bottom left region (around Φ -60o and Ψ -60o) corresponds to alpha helices.

The dots represent individual residues in the protein structure. A significant number of the residues fall within the allowed regions (in the green areas), indicating that most of the protein residues adopt common, stable secondary structures in both MBP-CAAR-v1 and -v2.

In version 2 of the MBP-CAAR construct there are very few outliers, suggesting that the overall structure of the protein is well-folded and stable. The outliers correspond to residues in loops.

Image: Structure assessment of the MBP-CAAR version 1 3.
Image: Structure assessment of the MBP-CAAR-version 2 2.

MolProbity

MolProbity version 4.4 helped us evaluate model quality at both the global and local levels for both proteins and nucleic acids.

Score Comment Goal Result in MBP-CAAR version 1 Result in MBP-CAAR version 2
MolProbity Score Combined protein quality score that reflects the crystallographic resolution at which such a quality would be expected As low as possible 3.92 3.17
Clash Score van der Waals sphere overlap > 0.4Å As low as possible 39.78 174.09
Ramachandran Favoured Backbone dihedrals in area occupied by ≥98% of Ramachandran statistics ≥ 98% 41.90% 92.38%
Ramachandran Outliers Backbone dihedrals in area occupied by ≤0.05% of Ramachandran statistics ≤ 0.05% 38.73% 2.22%
Rotamer Outliers Unlikely conformations given rotamer statistics ≤ 0.3% 12.40% 0
C-Beta Deviations Position deviates from ideal by > 0.25Å Zero 0 0
Bad Bonds > 4σ deviations from ideal < 0.01% 145 / 2526 0/2527
Bad Angles > 4σ deviations from ideal < 0.1% 320 / 3412 7/3414

CRISPR guide RNA Modeling

To design appropriate guide RNAs (gRNAs) for CRISPR-Cas9 technology that would be used to induce FOXP3 expression in the K562 cell line, we use the CRISPick software. The first step was selecting a reference genome from the available options, along with a mechanism and enzyme. We then entered the target gene and chose a CRISPick quota (we picked the default which is five results per target).

After validation, the target resolution phase matches input genes to known entities, retrieving sequence data. Then, CRISPick scans for PAM sites and generates sgRNA candidates. These are then ranked based on on-target and off-target criteria to finalize the guide selection. We repeated the process several times and sorted through the results to make sure our sgRNA picks did not have sequence overlap.

crispick-interface
Image: CRISPick Interface.
crispick-analysis-results
Image: CRISPick Analysis Results.

microRNA Theoretical Models

Toehold-Mediated Strand Displacement

A single strand known as invader (X) and a double strand complex are the reactants of toehold mediated strand displacement. The two strands that make up the complex are the incumbent (Y) strand and the substrate (S), which has the toehold, a single-stranded overhang.

strand-displacement
Figure: Toehold-Mediated Strand Displacement.

The invader strand is entirely complementary, according to Watson-Crick rules, to the substrate, and these two strands can attach to each other utilizing the toehold domain. The dissociation of the toehold makes this process reversible. At the branch migration stage of the reaction, the invader, incumbent, and substrate are arranged in a three-stranded complex that is oscillating. When the incumbent is released, the response is finished. Overall, the net gain in base pairs is what thermodynamically propels the displacement forward.

microRNA mechanism constraints

The T structure is comprised of three strands, T1, T2 and T3, functioning as follows:

  1. The T1a strand is complementary to microRNA-125a-3p and has a few unpaired bases at the 3’ end that can readily bind to the 5’ end of the microRNA.
  2. T2 is designed complementary to miR-146a-5p.
  3. T3 is the sequence that results from the complementarity to T1 and T2, with an additional 2 unpaired bases that resulted from trial and error.

In abundance of microRNA-125a-3p, first T1a, then T1b thermodynamically unbind and the whole T1 strand is released, leaving unpaired bases on the other two strands. If microRNA-146a-5p also exists in abundance, it binds and releases T2, its complementary strand, and T3, the initiator for the hybridization process.The HCR set consists of Hairpin H1, H2, H3, and H4. The hairpins are designed with specific domains and their complementary ones. Domain a and b are based on the initiator sequence, while the rest are random and follow hairpin design considerations listed at the end of the Design page. The initiator is complementary to the 5' end of the hairpin 1. When the initiator hybridizes with H1, H1 unfolds and binds to the H2, following the rules of toehold-mediated strand displacement which unfolds and in sequence causes H3, then H4, then H1 again to undergo a hybridisation reaction, thus forming a chain (HCR). With each cycle involving H1 through H4, the two therapeutic miRNAs, miR-219a-5p and miR-338-3p that were previously hybridized in vitro onto the HCR hairpins, are thermodynamically displaced and released.

microrna-mechanism-new
Figure: General overview of T structure function, initiator release and HCR triggering.

Free Energy Gibbs

An RNA hairpin's free energy plays a major role in determining its stability. Let's define this thermodynamic quantity first before looking at its relationship with hairpin formation.

Gibbs energy, another name for free energy, is a unit of measurement used to assess whether a process or reaction will require additional energy or occur spontaneously. It is defined by the equation:

G = H – TS

where G is Gibbs energy, H is the enthalpy of the system, T is the absolute temperature (in Kelvin), and S is the entropy 4.

The indicator of whether the process is spontaneous is the change in Gibbs energy. The process is thermodynamically advantageous (spontaneous) when ΔG is negative. Conversely, a positive value necessitates the input of external energy for the process to be executed 5.

When it comes to biological systems, free energy plays a key role in forecasting the stability of intricate molecular structures. A shift in free energy occurs when an RNA molecule adopts a hairpin structure. A hairpin is more likely to form and remain intact if its Gibbs free energy (ΔG) changes negatively, indicating that the hairpin is more stable than its unfolded form 6.

However, predicting the free energy can be a difficult task when designing RNA hairpins. To tackle this complexity, several computer models have been created to precisely determine the free energy related to RNA folding. With the consideration of variables like base pairing, loop size, and sequence context, these models utilize sophisticated algorithms and empirical data to forecast the thermodynamic stability of the hairpins. The ones we used were NuPACK and RNAFold (see below).

We made many attempts (Fig.3) to design hairpins that are structurally and functionally sound before landing on our final structures, which can be found in our Parts Page.

crispick-analysis-results
Figure 3. Schematic representation of one of our Y prototypes. Unpaired bases on both Y1 and Y3. Greater thermodynamic instability of the Y structure led to the more structurally sound T design.

microRNA specificity mechanism (T-structure)

Our own python software

Note: For our Y design, we tried to build upon our 2022 project “Theriac” Y design python script. Upon attempt, it appeared to be malfunctioning; after troubleshooting it, we uploaded the new, up-to-date notebook on the team Gitlab.

The notebook is written in Python 3.11 and uses the libraries BioPython (biological sequence manipulation), Pandas (manipulating data- exporting CSV), NumPy (for numerical computations and NuPACK. NuPACK is a package designed for analyzing and designing nucleic acid structures, which we used here for the thermodynamics evaluation.

The notebook assists the design of the T1, T2, and T3 sequences that form the T-structure outlined in the Design page. It iteratively generates different T1, T2, and T3 sequences, evaluates them using a Gibbs free energy function (`gibbs(Y1, Y2, Y3)`), and records the results. The goal is to find the best combination with the lowest Gibbs energy, ensuring stability and specificity of the T3 (the HCR initiator) release only in the presence of both microRNA1 and microRNA2; in our case microRNA-125a-3p and microRNA-146a-5p, respectively.

The notebook interacts with the user by asking for the sequences of the first and second microRNAs involved in the release process. This enables the user to design an infinite amount of different specificity structures depending on the microRNA combination they wish to implement.

Example of an attempt to input microRNA sequences
Figure: Example of an attempt to input microRNA sequences.

The Gibbs free energy values are stored in a DataFrame and exported by the notebook to a CSV file, which helps in identifying the most optimal sequence for the initiator strand (Y3 or, in our case, T3).

screencapture-y-structure-output

Model design notes:

  • The nucleotide sequences are generated randomly in each iteration, and the stability is evaluated to find the optimal structure.
  • The inputs are whichever two microRNAs the user selects through data analysis for the specific release of T3 sequence.
  • The primary criterion for selecting the best sequences is the Gibbs free energy, with lower values indicating more stable structures.
  • The notebook iterates multiple times to explore different possibilities for the T1, T2, and T3 sequences, eventually selecting the sequence that yields the lowest Gibbs energy (most stable configuration).

How did this model help us?

y1-3-sequence-output

The model offered the suggested strands for the structure, along with their free energy Gibbs, which we subsequently ordered after visualising on the NuPACK online module and confirming its thermodynamics.

Nupack

pair-probability

The diagram above outlines the probability that base x from strand A is paired with base y from strand B at equilibrium.

structure-entropy

Hairpin modeling (HCR)

To design and assess our HCR hairpins, we analyzed our sequences in two different nucleic acid structure prediction tools to cross reference results, refine search parameters and ensure accuracy.

RNAfold Hairpin output

For the design, calculation of entropy and minimum free energy (MFE) secondary structure of our HCR hairpins, we used RNAfold, a software tool that processes sequence input according to certain parameters such as free energy, unpaired nucleotides and thermodynamic state variables (temperature and volume) to export a predicted molecular structure. RNAfold uses the dynamic programming algorithm first proposed by Zuker and Stiegler to predict the MFE secondary structure of single RNA sequences. This algorithm methodically assesses every potential base-pair interaction to effectively determine the most thermodynamically advantageous configuration of an RNA molecule.

References

  1. Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N. Kinch, R. Dustin Schaeffer, Claudia Millán, Hahnbeom Park, Carson Adams, Caleb R. Glassman, Andy DeGiovanni, Jose H. Pereira, Andria V. Rodrigues, Alberdina A. van Dijk, Ana C. Ebrecht, Diederik J. Opperman, Theo Sagmeister, Christoph Buhlheller, Tea PavkovKeller, Manoj K Rathinaswamy, Udit Dalwadi, Calvin K Yip, John E Burke, K. Christopher Garcia, Nick V. Grishin, Paul D. Adams, Randy J. Read, David Baker. Accurate prediction of protein structures and interactions using a 3-track network. Science 10.1126/science.abj8754 (2021); doi: https://doi.org/10.1126/science.abj8754.
  2. Structure Assessment. Waterhouse AM, Studer G, Robin X, Bienert S, Tauriello G, Schwede T. The structure assessment web server: for proteins, complexes and more. Nucleic Acids Res 52, W318-W323. (2024)
  3. Structure Assessment. Waterhouse AM, Studer G, Robin X, Bienert S, Tauriello G, Schwede T. The structure assessment web server: for proteins, complexes and more. Nucleic Acids Res 52, W318-W323. (2024)
  4. Doshi, K.J., Cannone, J.J., Cobaugh, C.W. et al. Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics 5, 105 (2004). https://doi.org/10.1186/1471-2105-5-105
  5. DALE T, SMITH R, SERRA MJ. A test of the model to predict unusually stable RNA hairpin loop stability. RNA. 2000;6(4):608-615. doi:10.1017/S1355838200992495
  6. Zhou, R., Zeng, Z., Sun, R., Liu, W., Zhu, Q., Zhang, X., & Chen, C. (2021). Traditional and new applications of the HCR in biosensing and biomedicine. Analyst, 146(21), 7087–7103. https://doi.org/10.1039/d1an01371h