Software | ASU - iGEM 2024

OpenBind: Accessible Binder Generation and Evaluation For All

Machine Learning is a field of research that has exploded upon academia, largely due to the prevalence of AI-like models such as ChatGPT and DALL-E. In the field of Protein Engineering, AlphaFold in particular has accelerated protein structural prediction and sequence mapping of structural motifs. However, despite all of the progress that has been made, there still remains a gap in the region of an accessible approach towards leveraging Machine Learning Models for the generation of proteins with binding in mind. Specifically, we identified peptides as a novel avenue that falls in the intersection between modelability and specificity for therapeutic applications. As part of this development pipeline, we have documented and curated a simplistic workflow that can empower even those with no computational experience to perform full protein design pipelines. This year, ASU iGEM developed a fully training-independent and computationally cheap workflow for the generation of and design of novel protein structures, especially with synthetic biologists in mind.

Chapter 1: ML Architecture and Formulation of Key Components

To develop a workflow that is both easy to digest and widely acclaimed, we decided to utilize previously published models. This decision was a difficult one to make, but at the end of the day, part of using a tool would be to also understand them. While it is always interesting to propose and create novel machine learning models, the focus of this project is not that at all. We have to consider the feasibility of using these tools, as well as their accuracy.

To this end, we decided primarily to utilize RFDiffusion, ProteinMPNN, and AlphaFold 2 to solve protein structural prediction and generation, as well as sequence generation and protein fixing. The following section will give a high-level overview of each model, as well as a concrete example of the inputs and outputs.

Key Vocabulary:

SO(3): Special Orthogonal Group (3), or sometimes denoted as SE(3) represents a special Euclidean Group in Group theory, whereby the transformations and functions which satisfy this group are rotation and translation invariant in 3d space. This is important in Proteins especially, because the complete structure of a protein should not change based on rotations/translations.
.pdb filetype commonly used in protein structural analyses. It encodes cooredinates of residues and chain identities as well as residue identities. Furthermore, it gives insight into secondary structure and ordering of residues
backbone commonly used to refer to protein backbones, i.e. respresentations of tertiary structure without accounting for sidechains or residue identity

To learn more about the different architectural details, click on the different colored headers!

RFDiffusion

ProteinMPNN

AlphaFold 2

DynamicBind, GAABind

Introduction: RFdiffusion, as a Machine Learning approach towards De Novo protein design, represents a major player in the p(structure | structure) space. It is unique in this aspect, as most other diffusion-based architectures are either unguided (proteus) or pLM-based (ptm-Mamba). To this respect, RFDiffusion both allows for the generation of complementary protein structures, as well as focusing on structural constraints. The reason we chose not to use pLM-based models is due to the low validation which has occurred within the space. We fully believe that there are flaws within the current structure-based generation approaches, but accept that there is little we can do without extensive computational power and money to fund such efforts.

Architecture: RFDiffusion is a SO(3)-invariant diffusion model which has been trained on solving the gaussian noise of residue coordinates within a structure. Given physical constraints of bonds, RFDiffusion essentially injects gaussian noise into the positions of residues of a given protein until the structure is closer to “pure gaussian noise”. Then, in the inverse process, RFDiffusion attempts to reconstruct the structure of the input protein, and in the process, learns the weights for solving structures from a random distribution of residues in 3d space. For a more in-depth explanation, please see their repository here or their paper here

Role: With this conditional generation model, we utilize its ability to generate complementary structures and attempt to target key hot spots on our protein of interest. In this context, it is important to note that RFDiffusion does not make any considerations about the sequence or side chains of the binder which is diffused, and the output of this process is a binder with coordinates, but represented wholly as a glycine chain. We understand that this is not biophysically possible, and that RFDiffusion should only be used for backbone design.

At-a-glance: Input target protein structure file (.pdb) and hotspot residues, outputs a structure file (.pdb) containing original input target and complementary backbone (binder).

Introduction: ProteinMPNN is a movement away from the solvation of structures themselves, but rather focuses on “fixing” and redesigning sequences. That is, given an input structure, as well as the residues which need to be redesigned as a prompt, ProteinMPNN will create substitutions within the structures, and create a new sequence. For our workflow, this step is essential due to the complete lack of side chains within our prospective binder after the RFDiffusion step, and ProteinMPNN helps us create a more salient representation. For more information on this work, please refer to here.

Architecture: ProteinMPNN is a Graph Neural Network based architecture which has learned the distances between interacting residues, and essentially calculates the most probable residues to occur at each position. It does this by encoding sequence and structure separately, and then calculates the probability of each residue identity at each position in a protein. Finally, it outputs a protein sequence, given the heavy-atom distances between two interfaces.

Figure 3: Visualization of ProteinMPNN Architecture

Role: With the protein structure fixing model, ProteinMPNN represents the bridge between two different approaches. While it utilizes sequence for creating viable structures, ProteinMPNN is also able to create a protein sequence which can be used in downstream synthesis and predictions as well. While position information is important, we consider that sequence information is more significant in terms of biophysical interactions.

At-a-glance: Input structure file (.pdb) containing target and backbone, prompt model with residues to bind to (should be the same hotspot residues) and chains to redesign (backbone). Outputs a sequence with ProteinMPNN evaluation loss.

Introduction: AlphaFold 2 is a widely acclaimed and used protein structure prediction tool. As a continuation of the AlphaFold series, the main benefits and reasons to use AlphaFold2 for our uses lie in its ability as a backpropagation step for sequence design. While it is true that ProteinMPNN is provably effective at fixing sequences of backbones given structure, it is important to not overfit our results on tools developed in the RoseTTAFold space. Both RFDiffusion and ProteinMPNN come from the same lab, which means that their approach to protein design is more unary, and could lead to overlooking some beneficial aspects of other approaches.

Architecture: AlphaFold2 is the second generation of the Google Deepmind structural prediction tools, and represents an interesting paradigm shift in structural prediction. In it, it employs a few key architectural improvements, namely it employs a triangle attention mechanism and MSA embeddings. The way that MSA works is that it uses Multiple Sequence Alignment information to capture evolutionary information in the sequence. In this sense, the sequence identity and similarity of protein sequences is calculated, and used as an input in its Transformer. In this end, the Transformer is SO(3) invariant and utilizes triangular updates to nodes to prevent the creation of structures which are impossible. It is referred to as Triangle Attention because it calculates the attention weights for nodes while enforcing the triangle inequality, i.e.

For All Triangles with Vertices {a,b,c}, a + b > c, where a, b, c are permutation invariant.

Finally, given the output representation of the Triangle Attention block, AlphaFold2 uses a deterministic folding module to solve structure from these outputs. Deterministic in that the same inputs in this module will always output the same outputs.

Figure 4: Visualization of AlphaFold2 Architecture

Role: As mentioned earlier, we would like to avoid overfitting to the representation heuristics of the RoseTTAFold suite of methods. As such, we use AlphaFold2 as a backpropagation step in designing protein sequences. This means that, with each sequence that is generated by ProteinMPNN, we then input the sequence of the target and the generated sequence into AF2 to predict the structure. Then, we calculate the RMSD between the AF2 structures and the original generated structures of RFDiffusion as a measure of feasibility and accuracy for the sequences of ProteinMPNN.

At-a-glance: Input sequences from ProteinMPNN. Output structural prediction and stability metrics.

Introduction: Throughout the course of this project, we realized that it wouldn't be enough to rely on the metrics returned by AlphaFold2. In fact, AlphaFold2 is a provably flawed in terms of structure prediction given sequence. This is covered in this section of the engineering page. Although we didn't come to this realization until later, we decided to use binding prediction models early on as a final metric of success. The angle for this approach is largely because protein function is not directly correlated with structural stability; in fact some of the most functional binding portions of Antibodies are structurally unstable; their instability is what grants them their binding ability. However, in the space of peptide binding prediction or PPI prediction, very few pre-trained models existed with our purpose in mind. To avert this, we turned towards general ligand binding models as well as protein pose prediction models. This led us to choose DynamicBind and GAABind

Figure 5: Visualization of DynamicBind Architecture

Architecture: DynamicBind is an interesting combination of previously described Architectures, namely, it utilizes a Diffusion-Based, SO(3)-Equivariant Graph Neural Network structure to learn protein conformational states. The model is trained to first take an abstracted view of the input protein and ligand, with no coordinate information of their interactions. Then, over diffusion time steps, the model induces conformational changes and rototranslational updates to both structures, until a stable state is reached or until the model believes that it has reached a separate conformation. In its original paper (here), DynamicBind displays results which are more accurate than AlphaFold2 in structural prediction of conformational states of bound proteins. Furthermore, DynamicBind gives a conceivable metric of binding affinity, i.e. -log(Kd), which can be experimentally validated in downstream analyses.

Figure 6: Visualization of GAABind Architecture

GAABind was our second choice for binding affinity evaluation of designed structures due to its ability to integrate binding residues as part of the input, as in, GAABind would be more of a supervised approach to binding affinity approximation, while DynamicBind was explicitly unsupervised. In this sense, we would have a more complete view of the different models that are being used, and, if we do end up performing experimental validation, we would be able to generate a metric of which model is more accurate to the ground truth. Similar to DynamicBind, GAABind also creates a structural prediction and a binding affinity prediction (-log Kd). However, Architecture-wise, GAABind takes a more deterministic approach towards binding affinity. Given the binding pocket of the target, GAABind propagates attention-based updates between the ligand graph and the binding pocket, not unlike the way that ProteinMPNN generates its updates. However, in comparison, GAABind's approach is moreso changing the location of and the position of the complete ligand graph, whereas ProteinMPNN calculates probability densities for each residue. Then, GAABind uses the updated view of both graphs to generate a final structural prediction, along with a binding affinity prediction. For a more complete explanation, check out the original publication (here).

Role: Both DynamicBind and GAABind are used in this workflow as an alternative method for confidence. While AlphaFold2 can be used for structural prediction, it is important to consider methods which are directly applicable for the prediction of binding affinities and binding poses. To this regard, generated structures are first screened through GAABind, before being screened through DynamicBind, largely because GAABind is exponentially more expensive, computationally, than RFDiffusion, and DynamicBind is exponentially more expensive than GAABind.

At-a-glance: We use a supervised and unsupervised binding affinity prediction approach to gain greater perspective on the results generated by the aforementioned models. In the same vein as using AF2 to create diversity of evaluation, we use GAABind and DynamicBind for a direct metric of binding affinity.

Chapter 2: User Guide

In this section, we provide a detailed guide to using the software, and the recommended workflow for designing binding moeities.

Google Colab

Jupyter Notebook/ Batchscript

Introduction: Colab, or "Colaboratory" is a web-based python environment developed by Google. It is a server-hosted instance of Jupyter that allows Google Users to write and execute code without needing to download or set up any programs locally. It is best suited for portable implementations and tinkering. To familiarize yourself with the setup of Colab and the way the environment works, please check out the provided tutorial here

Step 1: Opening the code into a Colab environment

Step 2: Downloading and defining an accessible binding interface using the RCSB PDB

Step 3: Inputting the structure and arguments into RFDiffusion

Step 4: Choosing arguments to pass into ProteinMPNN, Downloading results

Step 5: Format and Filter results

Step 6: Evaluate valid results through GAABind

Following this step, there is a myriad of analyses that can be made. We encourage users to explore the organizational system output by GAABind and come up with their own interpretation of the results.

Introduction: In contrast to Colab, Jupyter and sbatch workflows for this project are designed to be used on supercomputer clusters, and should not be used on local machines unless enough memory and compuational power is available. These workflows are more industrial-facing and can help companies gain traction in a rapidly growing and previously unused modality of protein functional assays.

Warning: this code Requires A100-40 at minimum and >30gb of disc space

Instructions: Downloading conda environments, clone repository and wget zipfiles

Chapter 3: Computational Cost Comparisons and Proposed Experimental Valiation

Time and Computational Cost Analysis

After assembling and cleaning the scripts which were used in the duration of this project, we decided to test our scripts and analyze the possible pros and cons of running each approach. Because our goal is accessibility, we decided to create two tasks, a minimal and a maximal task. This analysis does not include the binding prediction methods due to the extreme computational cost of both approaches, and the infeasibility to run full assays with those methods unless one is able to do so.

For each of the two aforementioned tests, we decided on four different platforms. These were Colab Free, Colab Pro, ASU's HTC cluster Jupyter Instances, and ASU's HTC cluster SLURM instances. The associated scripts that were used for these tests are in the github, which will be in the code availability section below.

The way the figures are generated is that 40 separate runs were conducted on each of the 4 instances, and an average was calculated for these, and plotted below.

Figure 7: Computing Unit Comparison of minimally sufficient task

Figure 8: Computing Time Comparison of minimally sufficient task

Figure 9: Computing Unit Comparison of maximal task

f10 — Figure 10: Computing time Comparison of maximal task

Discussion:

Although it may not be feasible to do a complete workflow on T4 GPUs and free google colab, it could still be possible. Our tests demonstrate the feasibility of minimal-computation approaches, as well as relatively cheaper approaches - one does not require the usage of HPC or university funding to start tinkering with and thinking about de novo design projects. The purpose of this work is to democratize and share code that has been developed over the course of this project's cycle, and allow others to build upon pre-existing work.

For a more exact estimation of cost, the CHE to cost ratio according to (here) is $0.12/CHE. In Google Colab, the prices are 100 compute units for 10$, and running an A100 on the scripts we developed costs ~10-15 units per hour, which gives us a price approximation of $1-$1.5 per A100 hour. However, it is important to note that purchasing 100 compute units and then using T4GPU with colab pro will decrease the costs to only around 1.3-2 units per hour, which changes your price approximation to even lower than $0.12/hour. We do not advocate the usage of Google Colab Pro, but merely demonstrate it as a relatively accessible platform.

So, to accommodate and try to gain some level of insight into our results, we attempted to assess the differences between the two binding models. As we can see, the Supervised binding model tends to create higher binding affinity predictions compared to the Unsupervised approach.

f13 — Figure 11: Binding Model Assessment

In the figure above, color denotes the length of the binders, while the red line is added to demonstrate where a linear relationship would occur.

The plot shown above suggests a heterogeneity between our two binding models, which is an interesting result to elucidate from this project. We hope that, with proper experimental validation and greater clarity of binding metrics in relation to in-vitro results, we hope that the puzzle set forth by this plot could be solved.

Proposed Experimental Validation

Experimental Validation is paramount to evaluation of model capabilities. For specifically, for the task of developing novel therapeutics that bind to relevant targets, binding assays are a necessity for salient and digestable information. To this end, we have considered various of performing experimental validation throughout the cycle. At the start, we met with ASU's Dr. Neal Woodbury, who had previously developed protein-language model based approaches towards developing peptide therapeutics. While our initial expectations for binding assays were fraught with overconfidence, Dr. Woodbury helped put into perspective the scope of and sensitivity of different assays. We discussed the possibilities of using Bio-layer interferometry (BLI) and fluorescence polarization (FP) binding assays. From his point of view, the relatively small size of our binder compared to our target would cause issues with fluorophor-conjuated methods such as Nano Luciferase assays.

After searching on ASU campus and realizing the lack of accessible BLI or FP hardware systems, we looked towards alternate binding methods that would capture the same flexibility as BLI and FP. Specifically, we looked into peptide SPOT arrays, and reached out to JPT technologies for further advice and domain knowledge. After this meeting, we came to the agreement that SPOT arrays are too limiting for our application -- we are interested in the interaction of secondary and tertiary structure in informing binding, and SPOT arrays came with fixed peptides to a cellulose plate. Then, JPT proposed the usage of an ELISA-like system (Figure 11) for capturing the flexible interactions of our peptides. Although we wouldn't get a direct binding affinity measurement, we could hypothetically get a relative measure of binding affinity with proper control targets. The biggest benefit is that, if we could synthesize and extract out target protein from E.Coli, then we could potentially perform the entire assay locally, with just a plate reader.

After much discussion and deliberation, we constructued Then we would wash the plate with a fluorophore-conjugated fusion protein with the target. For the target protein, we designed a sfgfp-fusion, and have planned to express it in E. Coli. However, we ran into a roadblock in that worries over the degeneration of tertiary structure of the protein would occur without also expressing the heteroligomer of our target. So, to remedy this, we looked into previous literature of expression. Specifically, we used this paper as a direct reference for the construct. Going off their cloning methods, we constructed the plasmid map of a PCDFDuet expression system with our target protein.

Unfortunately, we were unable to complete this aspect of the project due to budget issues with the synthesis of peptides. We hope to re-evaluate the peptides we want to validate and constrain our test set to a more financially feasible number of peptides.

Conclusion and Future Steps

The workflow we have developed for this cycle of iGEM is designed specifically for the task of de novo peptide design. To this end, we have curated an open-source platform, hosted largely through github and huggingface for the ease of access for the public. In particular, most of the script regarding the generation of sequences and structures is largely accessible and usable for the general populace. We utilize widely-regarded and established models, and abstract user interaction away from the models themselves in more cognizant scripts. Furthermore, we offer a full pipeline for the workflow, guaranteeing non-experts and laymen to easily use and access our models. We believe that, by democratizing machine learning tools in protein design as well as scripts to create end-to-end platforms, we can increase the accessibility of these tools. In the future, we will continue to refine our approach, and look into reducing some of the possible biases that are currently present within the script (AF2).

For potential future users, we stress the importance of fostering communication through the repository and communication with our team. We want to make the user experience to be as smooth as possible -- if there are any questions please ask ASAP; the emphasis of this project and the code is to create more excitement and dialogue between computational efforts and wet-lab efforts.

Code Availability

All code can be found here: https://github.com/patjiang/openBind/tree/main

For the iGEM Project, the code is also hosted here: https://gitlab.igem.org/2024/software-tools/asu