Introduction

The model is a means to understand the microscopic world of small molecules. Our team has constructed a model of a polypeptide consisting of 37 amino acids. We use this model to comprehend the physicochemical properties of Vg, to understand the microscopic process of Vg binding to its receptor, and to design protein mutations of Vg based on the theory of "Darwinian" evolution, aiming to achieve a polypeptide solution that surpasses nature.

Understanding Vg

Our Vg is derived from soybeans. As a wild-type polypeptide, it was found to have an excellent hypoglycemic effect in preliminary experiments. Taking this as our starting point, we analyzed Vg. Initially, our team employed an Artificial Neural Network (ANN) algorithm, which continuously learns from a large dataset of known phosphorylated protein sequences to predict phosphorylation sites.

Vg's structure and physicochemical properties

Ultimately, our team selected the NetPhos3.1 algorithm because it has extensively and systematically learned to capture the relationship between protein three-dimensional structure and function among the many algorithms using Artificial Neural Networks (ANNs). This aspect is often overlooked in traditional sequence-based prediction tools^[1]. By integrating this information, NetPhos 3.1 provides a more comprehensive perspective for the prediction of phosphorylation sites.

Figure 1. llustrates the architecture of the artificial neural network (ANN) algorithm, which simulates the cognitive and learning processes of the human brain, showcasing a large and systematic learning framework that captures the relationship between protein three-dimensional structure and function.

Our team processed the results exported from NetPhos 3.1 and identified phosphorylation sites at positions 18 and 36. These phosphorylation events can influence the intracellular localization of the protein. By altering the protein's hydrophilicity or its interaction with the cell membrane, phosphorylation can cause the protein to translocate from the cytoplasm to the cell membrane or the nucleus. This is crucial for cellular function and signal transduction^[2].

Figure 2. NetPhos3.1 exports predicted phosphorylation sites, with points marked as "yes" or exceeding the purple boundary indicating phosphorylation sites predicted with higher confidence by machine learning, resulting in predicted phosphorylation sites at positions 18 and 36.

Following this, our team employed the JNet algorithm, which utilizes the Position-Specific Iterated BLAST (PSI-BLAST) tool and Hidden Markov Model (HMM) profile configurations, coupled with a two-layer neural network for prediction, achieving a high level of accuracy. According to the latest Jpred4 algorithm, the accuracy rate for secondary structure prediction reaches 81.5%. We then used Jpred4 to predict the secondary structure of Vg,Our team observed a special secondary structure, a beta-sheet, present between amino acids 21 and 31.

Figure 3. The JPred4 algorithm exports predicted secondary structures of Vg, where the green regions indicate β-fold regions, with β-fold regions present at positions 21-31. The data transmission route is divided into three parts: yellow indicates input data, blue indicates abstract network activations, and green indicates output data.

Our team subsequently utilized the Conserved Domain Database (CDD) within the NCBI database, employing the BLAST (Basic Local Alignment Search Tool) and CD-Search tools to identify a highly reliable Albumin_I superfamily functional domain within Vg.

Figure 4. The conserved domain database (CDD) from the NCBI database was used, and through CD-search, a highly reliable albumin-I superfamily functional domain of Vg was identified. According to the annotations from the conserved domain database (CDD), the albumin-I superfamily functional domain can bind to the 43 kDα receptor protein VDAC-1.

According to the description from the Conserved Domain Database (CDD), the functional domain of our Vg's Albumin_I superfamily is associated with three beta-sheet regions. Connecting this to our previous observations, our team noted a special secondary structure, a beta-sheet, between amino acids 21 to 31, which is capable of binding to the 43kDa receptor VDAC-1.

Our team utilized the deep learning architecture and neural networks of AlphaFold3 to predict the three-dimensional structures of Vg and its receptor. This approach allows us to understand the potential interactions and binding modes between Vg and VDAC-1, which is crucial for elucidating the molecular mechanisms underlying their interaction.

Figure 5. Displays a simple architecture of AlphaFold3, divided into three modules: blue represents MSA & template, green represents PairFormer, and pink represents the diffusion model. The data transmission route is divided into three parts: yellow indicates input data, blue indicates abstract network activations, and green indicates output data.

To better showcase the structures, our team used PyMOL to create three-dimensional images.

Figure 6. Figure A clearly shows that Vg has a spherical structure, which provides a basis for its oral availability. From Figure B, it can be observed that there are disulfide bonds between the amino acids at positions 7 and 22, 3 and 15, and 20 and 32, which are crucial for the structural stability of Vg.

Evolutionary Tree Construction and Homology Alignment

Following this, our team conducted a search for Vg amino acids in the NCBI database. Here are thirty-three proteins similar to Vg. Our team used MEGAX to construct an evolutionary tree based on evolutionary relationships and performed homology alignment.

Figure 7. Utilized the align by clustalw function of megax software, where the yellow regions represent conserved domains and the white areas indicate mutation sites.

Figure 8. The data in this section is sourced from the blast in the NCBI database, employing the maximum likelihood algorithm of the Jones-Taylor-Thornton (JTT) model in megax software to construct a phylogenetic tree, which was beautified using cnsknowall (https://cnsknowall.com/). The red star marks Vg, the species in the green branches belong to Fabaceae, and those in the blue branches belong to Euphorbiaceae. No protein sequences with high overall similarity were found in animals and microorganisms.

We found that there are conserved regions (yellow areas) in different proteins across various species. Subsequently, our team conducted a positive selection site analysis on the functional domain of the Albumin_I superfamily formed by amino acids 21 to 31. We utilized the branch-site algorithm and likelihood ratio tests to analyze the accelerated rate of non-synonymous substitutions (dN) in specific genes as a measure of positive selection^[3]. This research on positive selection provides an important theoretical basis for protein engineering, which can assist in designing proteins with specific functions^[4]. In terms of algorithm selection, we chose PAML because, compared to other software such as HyPhy, PAML can process large datasets more quickly when analyzing selection pressure^[5] and also incorporates model-based inference. This approach makes the selection process more scientific and reduces the risk of selection bias^[6].

Figure 9. Shows the branch-site algorithm of PYMAL

Determination of the mutation scheme

Through the comprehensive analysis of the above methods, our team established the mutation sites and schemes. Thus, we integrated the different amino acid evolutionary strategies at positions 25, 27, and 28 from 33 similar peptides, resulting in a total of 26 mutant products, which we named（R25A，R25A_V27F，R25A_V27F_V28A, R25A_V27F_V28I, R25A_V27L, R25A_V27L_V28A, R25A_V27L-V28I, R25A_V28A, R25A_V28I, R25G, R25G_V27F, R25G_V27F_V28A, R25G_V27F_V28I, R25GV27L, R25G_V27L_V28A, R25G_V27L_V28I, R25G_V28A, R25G_V28I, V27F,V27F_V28A, V27F_V28I, V27L, V27L_V28A, V27L_V28I, V28A, V28I)

Figure 10. (A), (B), and (C) respectively display the mutation sites 2, 5, 2, 7, and 2, 8 of Vg. The sites 2, 5, 2, 7, and 2, 8 are represented in stick form, while the remaining parts are depicted in cartoon form.

Initial screening

Our team plans to first assess the binding potential of Vg's molecular ligands. By employing machine learning techniques and empirical scoring mechanisms, we aim to distinguish between active and inactive compounds, thereby eliminating some mutation schemes that have become inactive due to random mechanical combinations. We have utilized the chemScore algorithm, which decomposes the total energy of the protein-ligand complex into multiple energy terms, removes unreasonable geometric configurations and mutual positions, and finally obtains the activity of the molecular ligand through a scoring function. We chose the chemScore algorithm because of its fast calculation speed, allowing for extensive and rapid initial screening.

Subsequently, our team assessed the stability of the protein-ligand complex. We used the Hex scoring function of Discovery Studio software, which collects and calculates various energy terms of protein-ligand complexes after virtual docking, weights them through the scoring function, and combines them. We considered the Hex scoring function because it takes into account a variety of interactions in its algorithm, particularly in the calculation of hydrophobic effects, entropy effects, and electrostatic interactions, and has the advantage of requiring less computational power, serving as a second reference for our initial screening.

Finally, from the perspective of the speed at which the protein-ligand complex reaches binding energy equilibrium, we selected the Glide module within Glide Score. We used the OPLS all-atom molecular force field, taking advantage of the full-atom model mechanics analysis within the OPLS force field. Due to its maintenance of computational efficiency, some parameter ranges were simplified, making it the basis for the third initial screening. We also used Origin to plot a three-dimensional graph.

Figure 11. The x-axis represents the results derived from the chemscore algorithm in Maestro software, the y-axis represents the hex scoring function from Discovery Studio software, and the z-axis represents the glide score results from Maestro software. The binding site of VDAC-1 with Vg was again determined based on the binding positions obtained from the autodocking blind docking algorithm. In the figure, VDAC-1 is displayed in stick form, the binding sites are shown as white surfaces, and the binding pockets are represented as purple cubes. The red spheres indicate Vg, while the blue spheres represent the other 32 evolutionary strategies. In the data processing, since the chemscore results are within the range of (0-1), and both the hex scoring function and glide score represent binding free energy, we artificially took the negative values of all binding free energy data for easier visualization and observation of conclusions.

By analyzing the results derived from the three algorithms (the activity of Vg and the calculation of the free binding energy by the two preliminary scoring functions), it can be considered that the binding effect of Vg with 2jk4 is significantly superior to other evolutionary strategies. However, these algorithms have simplified and ignored some parameters and binding forces in order to save computational power. The next step could involve considering the use of more complex algorithms for more refined predictive calculations.

Fine screening

In the fine screening process, we first employed protein-ligand blind docking to simulate molecular behavior in a real biological environment. By analyzing the results of molecular docking, we identified the binding pocket of the receptor VDAC-1 for the next step of specific docking, obtaining an important parameter. In the protein-ligand blind docking, we chose the Vina force field combined with the Autodocking algorithm. Utilizing the Vina force field, we plotted the mathematical expression^[7] that describes the dependence of the system's energy on the particle coordinates, calculated the total energy of the molecular system to assess the stability of different conformations, and thereby obtained the most stable 3D conformation structure and system energy. Finally, using the global optimization algorithm of Autodocking, we were able to search over a wider range to find the optimal binding mode.

Figure 12. First, the binding pockets of VDAC-1 were predicted using the sitmap function in Maestro software, resulting in ten binding pockets. The binding site of VDAC-1 with Vg was then determined based on the binding positions obtained from the autodocking blind docking algorithm. In the figure, VDAC-1 is displayed in stick form, the binding sites are shown as white surfaces, and the binding pockets are represented as purple cubes.

Prior to this, all algorithms and scoring functions employed empirical functions, which, while conserving computational power, are less rigorous in terms of theoretical foundation. Therefore, our team utilized the more theoretically rigorous MMGBSA (Molecular Mechanics Generalized Born Surface Area) algorithm model. The principles are as follows:

⚫ Free Energy Decomposition:

First, the total binding free energy in the solvent is split into two parts: the molecular mechanics term (the binding free energy in vacuum) and the solvation energy, which are calculated separately. This avoids the situation where the energy contribution is primarily from the interactions between solutes, and the fluctuation range of the total energy is much greater than that of the binding energy. Such a scenario would require a very long time to converge. To address this, the following thermodynamic cycle is implemented. The calculation formula and the principle diagram are as follows:

Figure 13. Thermodynamic Cycle Schematic Diagram for the MMGBSA Algorithm Model

⚫ Molecular Mechanics Calculations:

Next, the energy of each molecule is calculated by considering the energies of bonds, bond angles, and dihedral angles, non-bonded van der Waals energies, and non-bonded electrostatic energies, which together constitute the energy absorbed during molecular docking.

The binding free energy in vacuum is calculated using the Second Law of Thermodynamics, with the conformational entropy determined through Normal Mode Analysis (NMA). However, in our model, we have ignored the changes in conformational entropy because the spatial changes of the ligand and receptor before and after docking are not significant, and Normal Mode Analysis (NMA) has its own deficiencies.

⚫ Solvation Energy

Solvation energy encompasses polar solvation energy and nonpolar solvation energy. The polar solvation energy is primarily calculated using the Generalized Born (GB) equation, which accounts for the electrostatic interactions between the solute and the polar solvent. On the other hand, nonpolar solvation energy is computed using the Solvent Accessible Surface Area (SASA), which quantifies the exposure of the solute to the nonpolar solvent.

We used the same MMGBSA calculation scheme and TIP4P water and solvent models, as well as the GBSA solvent model, with different molecular force fields, namely the AMBER force field and the OPLS-AA force field. The AMBER force field is a classic molecular force field known for its high accuracy, which is achieved through the integration of a large amount of experimental data and quantum computation results^[8]. The OPLS-AA force field has borrowed some torsional parameters from the AMBER force field and has been refitted based on the quantum chemical energies of dipeptides, resulting in parameters that are more suitable for our study of peptide-protein binding models. The free binding energy calculation results from the two force fields were named AMBER and Prime MMGBSA, respectively. By combining these results with the values from AutoDocking Vina, we composed the refined screening outcomes. Consistent with the initial screening, we also used Origin to plot three-dimensional graphs.

Figure 14. The x-axis represents the results derived from the vina force field combined with the autodocking algorithm, the y-axis represents the results from the mmgbsa calculation scheme using the opls-aa force field, and the z-axis represents the results from the mmgbsa calculation scheme using the amber force field. The red spheres indicate Vg, the orange spheres represent the top three mutation strategies in terms of binding effectiveness with VDAC-1 excluding Vg, and the blue spheres represent the other 29 mutation strategies. In the data processing for figures.11 andFigure.14 , we took the negative values of all binding free energy data to maintain consistency.

Conclusion

After two rounds of screening, preliminary and refined; two methods, empirical functions and MMGBSA; utilizing four molecular force fields, OPLS all-atom force field, Vina force field, AMBER force field, and OPLS-AA force field, we comprehensively evaluated the significant advantages of Vg binding to the evolutionary strategy receptor VDAC-1. The wild-type Vg can be directly used for further wet-lab synthesis exploration.

Outlook

During the six rounds of calculations, our team noticed that the mutation at position 28 to A has great potential. In the future, we can use synthetic biology methods to further study position 28 and explore more theoretical mechanisms of molecular binding. When selecting mutation schemes, our team used the local alignment algorithm of NCBI's Blast function. Subsequently, we can learn from the profile-HMM algorithm mentioned by the 2023 NUU-China team, which can help us find more proteins with different structures but similar functions. Through comprehensive analysis, we can uncover the essential mechanism of Vg and VDAC-1 molecular docking.

References

[1]Ismail, H.D., Jones, A., Kim, J.H., Newman, R.H., Kc, D.B. RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest. BioMed Research International. 2016:2016:3281590. doi: 10.1155/2016/3281590. Epub 2016 Mar 15.

[2]Garcia-Garcia, T., et al. “Role of Protein Phosphorylation in the Regulation of Cell Cycle and DNA-Related Processes in Bacteria.” Frontiers in Microbiology 7 (2016): 184. doi: 10.3389/fmicb.2016.00184. Epub 2016 Feb 16. PMID: 26909079; PMCID: PMC4754617

[3]Yang Z. “Statistical properties of the branch-site test of positive selection.” Molecular Biology and Evolution 28, no. 3 (2011): 1217-28. doi: 10.1093/molbev/msq303. Epub 2010 Nov 18. PMID: 21087944; PMCID: PMC3030880

[4]Anderson, G., Hare, K.J., Jenkinson, E.J. “Positive selection of thymocytes: the long and winding road.” Immunology Today 20, no. 10 (1999): 463-468. Epub [Epub date not available in the sources]. PMID: 10500294

[5]Ramasamy, S., et al. “paPAML: An Improved Computational Tool to Explore Selection Pressure on Protein-Coding Sequences.” Genes 13, no. 6 (2022): 1090. doi: 10.3390/genes13061090. Epub 2022 Jun 18. PMID: 35741852;

[6]Álvarez-Carretero, S., Kapli, P., & Yang, Z. “Beginner’s Guide on the Use of PAML to Detect Positive Selection.” Molecular Biology and Evolution 40, no. 4 (2023): msad041. doi: 10.1093/molbev/msad041. Epub 2023 Apr 4.

[7]Wang, H., et al. “Applications and Advances in Machine Learning Force Fields.” Journal of Chemical Information and Modeling 63, no. 22 (2023): 6972-6985. doi: 10.1021/acs.jcim.3c00889. Epub 2023 Sep 26.

[8]Assessing the Current State of Amber Force Field Modifications for DNA. Journal of Chemical Theory and Computation. 2023; 19(13): 4299-4307. doi: 10.1021/acs.jctc.3c00233. Epub 2023 Jun 21. PMCID: PMC10339676.

SUGGESTION