I. INTRODUCTION
I.1. Modeling Objectives
This paper introduces Flint, a computational tool with one simple mission : generating mutants of protein receptors with enhanced binding specificity towards a target ligand. Built on PocketGen and integrated with AutoDock Vina for affinity evaluation, Flint allows precise optimization of receptor-ligand interactions by embedding the docking score directly into its learning transfer process. Our results? Better binding specificity, more stable interactions, and a clear path forward to make Flint even smarter. Our project addresses a common problem in molecular biology : how can we design protein receptors that bind more specifically and tightly to a target molecule ? These interactions are crucial in many biological processes, including drug development. Traditionally, improving these interactions in the lab involves a lot of trial and error, which can be slow and inefficient. Flint should learn from the each mutant docking, optimizing the receptor’s structure through iterative changes. The goal is to generate receptors that binds significantly stronger to the ligand, without having to rely on physical experiments at every step.
I.2. Input Data Structure
Flint requires two key input files to operate: the 3D structure of the protein receptor in PDB format and the 3D structure of the ligand in SDF format. The latter is widely used to describe a quite small molecule, containing not only the 3D coordinates of the ligand’s atoms but also additional information such as bond types and stereochemistry ; while the PDB format is useful to describe the 3D coordinates of the protein's atoms, providing a detailed snapshot of its spatial configuration. It is a standard way to allow computing tools to access specific data about the receptor's geometry, active site, and flexibility for mutation generation. However, both of the input files are locally converted in the PDBQT format, specifically used in docking simulations like those run by AutoDock Vina.
II. FIRST APPROACH
II.1. State of the Art
The issue we're trying to adress is for sure a central challenge in computational biology, and indeed, since some years, science is already able to use molecular dynamics simulations and structure-based algorithms to design new proteins. Since the AI revolution, machine learning and deep learning techniques are also integrated in the process to do plenty of predictive tasks that could have been terrible to do otherwise, even with the cutting-edge bruteforce algorithms. Tools like AlphaFold have revolutionized protein structure prediction, while generative and diffusion models have begun to explore the enormous space of possible protein mutations. However, this field lacks a tool that would effectively and simply combine a docking model and a mutation generator !
Several attempts have been made to bridge this gap. Rosetta, a widely used software suite for macromolecular modeling and design, includes modules for both protein docking and sequence design. While powerful, integrating these functionalities requires significant expertise and computational resources, making it less accessible for iterative mutation and docking cycles. Another tool, ProteinMPNN, was developed for rapid sequence design and generates protein sequences that are compatible with a given backbone structure. It excels in designing stable proteins but does not inherently account for ligand binding or optimize for improved docking scores.
Techniques like inverse folding methods have also been explored, which predict amino acid sequences from a given protein structure. While they can generate sequences that fold into a desired structure, they often do not consider ligand interactions during the design process. Generative Adversarial Networks (GANs) have been employed to generate novel protein structures and sequences; however, these models typically require extensive training data and may not focus specifically on enhancing ligand binding affinity. Additionally, DiffDock, a recent diffusion-based docking model, predicts protein-ligand binding poses with high accuracy. While DiffDock improves docking predictions, it does not generate receptor mutations to enhance binding affinity iteratively.
II.2. But what is a transformer ?
Transformers are a class of deep learning models that have significantly advanced the field of artificial intelligence, particularly in natural language processing. Introduced by Vaswani et al. in 2017, in one of the most known IA article ever published (Attention is all you need), the core innovation of transformers lies in their ability to handle sequences of data by focusing on the relationships between all elements simultaneously, rather than processing them sequentially. This is achieved through a mechanism known as self-attention, which allows the model to weigh the importance of different parts of the input data relative to each other when making predictions. At the heart of the transformer architecture is the concept of encoding input data into continuous vector representations and then computing attention scores between these representations. Mathematically, for an input sequence of vectors, the model computes queries Q, keys K, and values V using learnable weight matrices WQ, WK, and WV. Then, it calculates a score that determines how much attention should be paid to each part of the input. This is done using the scaled dot-product attention formula, where dk is the dimension of the key vectors. The softmax function ensures that the attention weights sum up to one !
Transformers are built by stacking multiple layers, each consisting of a self-attention mechanism followed by a position-wise feed-forward network. The self-attention allows the model to consider the entire sequence at once, capturing dependencies regardless of their distance in the sequence. The feed-forward network, typically implemented as two linear transformations with a non-linear activation in between, further processes the data at each position independently. Besides, an essential component of transformers is the multi-head-paradigm, where the attention mechanism is replicated multiple times (a "head" is a replication) with different parameter sets. This allows the model to capture various types of relationships in the data. If there are h heads, each head computes its own attention as described above, and the results are concatenated and linearly transformed to produce the final output of the attention layer.
In a Graph Transformer, the self-attention mechanism is modified to incorporate the graph's connectivity. The attention is computed not just over a sequence but over the nodes in the graph, taking into account their connections ; and it can be adjusted to only consider neighboring nodes or to weigh nodes differently based on their distance or type of connection. For a graph G = (V, E), where V is the set of nodes and E is the set of edges, the attention mechanism can be easily modified (see the expression below). Here, Qi and Kj are the query and key vectors for nodes i and j (respectively) ; e(i,j) represents the edge features between nodes i and j , and finally φ(e) is a function that maps these edge features to a scalar that influences the attention score.
There's different types of layers in a Graph Transformer, and the point is that by stacking multiple of these, the model can capture intricate patterns and relationships within the protein structure. Each layer allows the model to aggregate information from a broader neighborhood in the graph, enabling it to learn complex dependencies.
- Node Embedding Layers, transforms the raw features of each node (such as atom types, charges, etc.) into continuous vector representations suitable for processing by the model.
- Edge Embedding Layer: Similarly, this layer processes the features of the edges (such as bond types or distances) into vector representations.
- Graph Attention Layer: This is the core of the Graph Transformer, where the modified attention mechanism computes new representations for each node by attending to its neighbors, considering both node and edge features.
- Feed-Forward Network: After the attention mechanism, this network applies non-linear transformations to each node's representation, enhancing the model's expressive power.
II.3. PocketGen Overview
PocketGen is a deep generative model designed for the full-atom generation of protein pockets that bind specifically to target ligand molecules. Unlike previous methods that focus solely on either protein sequence or structure generation, it co-designs both the amino acid sequences and the 3D structures of the protein pocket region. This integrated approach ensures sequence-structure consistency and effective ligand binding. PocketGen formulates the pocket generation task as a conditional generation problem : it aims to learn the probability P(B | A/B, M) where B represents the set of pocket residues to be generated, A \ B denotes the protein scaffold (the rest of the protein excluding the pocket region), and M is the target ligand.
At the core of PocketGen is the equivariant bilevel graph transformer, a specialized neural network designed to capture multi-granularity interactions in protein-ligand complexes. The transformer operates on a geometric graph representation of the protein-ligand complex, where each residue or ligand is represented as a block (a set of atoms). The network consists of L = 6 layers, each comprising two main components (it's why we call it a "Bilevel" transformer). The first module is designed to compute attention at the atom-level and residue/ligand-level ; the second component is the Equivariant Feed-Forward Networks. They update atom representations and coordinates while preserving equivariance to 3D Euclidean transformations (such as rotation and translation). The total number of trainable parameters in the transformer is approximately 7 million, accounting for the weights in the attention mechanisms and feed-forward networks across all layers. Moreover, layer normalization is applied at each layer to stabilize and accelerate training.
To ensure sequence-structure consistency and incorporate evolutionary knowledge, PocketGen integrates a sequence refinement module that leverages pretrained protein language models (pLMs) from the ESM family. Specifically, the ESM-2 model with 650 million parameters is used. However, to reduce computational overhead, only a lightweight structural adapter is fine-tuned during training. The structural adapter consists of Structure-Sequence Cross-Attention. This mechanism aligns the structural representations from the transformer with the sequence representations from the pLM. For each residue, the structural representation is obtained by mean pooling over the atom features. The sequence representation is obtained from ESM, where the pocket residues are initially masked.
III. DIVE INTO THE MODEL CORE
III.1. Feature space
What we call "features" are essentially the set of all the information that the model uses to understand the world and make predictions ; and it's often modelized as a high-dimensional tensor of decimal numbers. In our case, each atom is represented by a vector containing 38 different features : in machine learning terms, we would have say that PocketGen's features dimension size is 38. These features capture various chemical and physical properties necessary for accurate modeling. First, it identifies the atom's element, such as (C), (N), (O), (S), (Se), and even detailed types like alpha-carbon (C*). All of that is described using a one-hot encoding (embedding vector). Then, in-depth description of what is happening during the interaction is added : how the atom is bonded, its partial charge, hybridization state, hydrophobicity, aromaticity, etc. For example, ligand atoms receive an embedding representing the ligand molecule's identity and characteristics. To account for the position of each atom within its residue or ligand, a positional encoding is included, that indexes the atom within the block.
The featurizer function φ mathematically maps each atom a(i,j) to its corresponding feature vector h(i,j) as follows (see below). This concatenation results in a high-dimensional vector, where d is the total dimensionality of the features, set to 128 in PocketGen. The featurizer function ensures that all relevant atomic properties are encoded in a numerical format suitable for the neural network. For each residue or ligand block i, it aggregates the atom feature vectors into a matrix Hi, and the corresponding atom coordinates are compiled into a matrix Xi, with n_i lines and 3 columns therefore capturing the 3D spatial positions.
To describe the interactions between atoms, PocketGen computes relative geometric features for pairs of connected blocks (i, j) . Basically, distance and euclidian distance are computed and then embedded using radial basis function (RBF) to create a continuous and differentiable representation. Here, β is a scaling parameter, μ are the centers of the RBFs, and K is the number of kernels :
III.2. Loss function
Loss functions play a key role in machine learning because they measure how far off a model's predictions are from the actual results. They provide a clear way to assess how well or poorly the model is doing. By calculating the error in predictions, the loss function gives a number that the model uses to improve itself.
The main purpose of a loss function is to guide the model during training. As the model learns, it adjusts its internal settings to try and minimize the loss, which leads to better predictions. Without loss functions, there would be no systematic way for the model to improve its accuracy. Here is the loss function for our model:
Let's try to understand it by expliciting these terms : Lseq is the sequence prediction loss, which encourages the model to predict the correct amino acid sequences for the pocket residues using cross-entropy loss ; where P is the true amino acid (one-hot encoded), and Phat is the predicted probability distribution over amino acids. Lcoord is the coordinate prediction loss, that ensures the predicted atomic positions are accurate, by applying the Huber loss function (defined below) to the predicted Xhat and true (X) coordinates. Then, Lstruct is the structural regularization loss, added to maintain chemical plausibility by penalizing deviations in bond lengths and angles from standard values. It is calculated using the same principle as Lcoord for predicted bonds angles and lengths. Finally, λ factors act like weights that balance the contributions of each term.
III.3. Gradient descent
As said earlyier, we need to minimize the loss ; but we cannot try all of the weights. To find which one gives us the lowest loss function, we need another method to arrive at the minimum. Gradient descent is one of the most used ones ! It's an iterative optimization technique that helps find the best values for the model's parameters by reducing the loss step by step.
In practice, gradient descent works by calculating the gradient (or slope) of the loss function at a given point, indicating how much the loss would change if the parameters were adjusted. The model then updates its parameters in the opposite direction of this gradient, taking small steps to reduce the loss, ie improve the model. By repeating this process over many iterations, the model moves closer to the optimal solution where the loss is minimized. The loss doesn't need to find a global minimum but only local ones, which is a (more) trivial task in comparison. Mathematically, gradient descent updates the parameters θ, which stands for the set of parameters.
Here, α is the learning rate, a small positive scalar determining the size of the steps taken during the update. It it one of the most important parameters, and doesn't need to be constant. The gradient $\nabla_{\theta} f(\theta)$ is a vector of partial derivatives of the function with respect to each parameter. The direction of the gradient points toward the steepest increase of the function, so moving in the opposite direction reduces the function's value. This process is repeated iteratively until convergence, meaning that the parameters θ reach a point where further updates produce negligible changes in the function's value.
IV. CURRENT STATE
IV.1. Sample
ID | delta_G | Kd | mutations (AA) |
---|---|---|---|
original | -12.0660 | 7.19e+08 | 0 |
mutant_0 | -6.7186 | 1.12e+05 | 4 |
mutant_1 | -6.5560 | 3.17e+05 | 4 |
mutant_2 | -6.1796 | 3.46e+04 | 4 |
mutant_3 | -13.0210 | 4.25e+09 | 5 |
mutant_4 | -7.6204 | 5.16e+07 | 2 |
mutant_5 | -12.0160 | 3.70e+07 | 4 |
mutant_6 | -10.1612 | 1.28e+09 | 3 |
mutant_7 | -6.5194 | 7.16e+08 | 4 |
This summary of the XylS Wild Type protein run with the ligand PA shows promising results, with some mutants exhibiting comparable or even improved binding affinity to the original protein, such as mutant 3, which outperforms the wild type. However, the majority of mutants display poorer affinity, indicating that while the methodology is effective, there is still considerable room for improvement. Fine-tuning the model further, particularly in how mutations are selected and evaluated, could lead to significant advancements in generating mutants with consistently enhanced binding properties.
IV.2. Flaws
PocketGen, while effective at generating protein pocket mutants for improving ligand binding specificity, faces several critical limitations. It struggles with predicting allosteric effects, which are crucial for understanding nonlocal protein interactions that influence binding, and it tends to treat proteins too modularly, missing emergent behaviors from whole-protein dynamics. PocketGen struggles with sequence-structure consistency, especially in flexible pockets, because it generates mutations without refolding the structure from the new sequence. This can lead to structural failures, as the protein's original fold may not hold, causing the binding pocket to shift or lose function. Without re-folding, mutations risk creating unrealistic or dysfunctional protein conformations, undermining prediction accuracy. The model is also biased towards well-characterized proteins, limiting its generalization to novel targets. Furthermore, its user interface is inflexible, hindering customization, and its reliance on docking algorithms like AutoDock Vina can result in reduced accuracy and high computational costs, especially with dynamic or large proteins.
Analyzing a broader set of 100 mutants with a similar distribution reveals a consistent pattern: while a small fraction of mutants show improved or comparable affinity to the original protein, the majority demonstrate reduced binding efficacy. Statistically, only about 5-10% of the mutants show equal or better binding, with a significant portion—nearly 70%—exhibiting noticeably weaker affinities. This suggests that, while the current approach can generate some successful mutants, the process is not yet reliable or efficient enough. The high variability and limited success rate indicate a need for substantial refinement to ensure more consistent improvements in protein-ligand binding across all mutants.
IV.3. Introducing Flint
Flint refines PocketGen by significantly improving allosteric effect prediction and providing a more flexible, user-friendly interface. It enhances the ability to model nonlocal protein interactions, allowing for more accurate predictions of how distant mutations influence ligand binding. The interface has been redesigned to offer greater customization, giving users more control over mutation parameters and docking settings. Flint is also rewritten for better implementation, making it easier to build and train custom versions. Additionally, it outputs results in a structured TSV format, including ranked mutants and binding scores for more efficient analysis.
In addition, we are now training a custom version of model, adjusting key parameters to align more closely with our specific goals. This tailored training will produce a checkpoint file of the updated model, which will be made publicly available for broader use. We've also introduced new metrics, such as a stability benchmark, to further refine mutant evaluation. Looking ahead, we plan to implement re-folding with AlphaFold to ensure that the generated PDB structures maintain realistic conformations, preventing nonsensical protein shapes.
REFERENCES
- Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. (2017). Attention is All you Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA / Advances in Neural Information Processing Systems 30, 3058.
- Zhang Z, Shen WX, Liu Q, Zitnik M. (2024) Efficient generation of protein pockets with PocketGen. Nature Machine Intelligence 6, 1382–1395.