Background

The unnatural amino acid modification of the CsgA protein in Escherichia coli is a key innovation in our team's project. However, we lack a detailed understanding of the physicochemical properties of the modified CsgA protein (referred to as CsgA-TET), which hinders our grasp of molecular behavior in the innovative click-to-release drug delivery system. Therefore, at this stage of the project, further research is needed on CsgA-TET's behavior in aqueous solution, including its aggregation, stability, and solubility. Measuring physical properties such as protein solubility, structural compactness, and surface area is crucial for studying the free energy changes associated with protein stability or aggregation.

For CsgA-TET with different unnatural amino acid insertion sites, using conventional wet lab methods would require extensive repetitive work from gene modification to electron microscopy scanning and DSC evaluation. This would be highly inefficient for our project development. Moreover, studying the mechanism of CsgA-TET aggregation solely through macroscopic measurements from wet lab experiments makes it difficult to provide a vivid description. Therefore, we decided to employ computer simulation methods, using molecular dynamics calculations to mimic protein molecular behavior and structural properties in aqueous solutions. This approach is efficient, time-saving, and can provide a wealth of information to help the team avoid massive workloads. It also offers visualized mechanism demonstrations, allowing for a more vivid, detailed explanation of the team's work to the public.

Methods

Basic Software

GROMACS: In the fields of computational chemistry and biology, numerous outstanding scientists and research groups have proposed their theories and methods, gradually developing into a scientifically rigorous system. Many interactive software packages have been created by developers to facilitate experiments for researchers worldwide. In molecular dynamics, there are several well-recognized, long-standing software packages with robust systems, such as AMBER developed by David Case's team at the University of California, San Francisco; NAMD developed by the Theoretical and Computational Biophysics Group led by Klaus Schulten at the University of Illinois Urbana-Champaign; and CHARMM developed by Professor Martin Karplus and his research team at Harvard University. Each has its own research advantages. Among these, GROMACS, originally developed by Herman Berendsen's team at the University of Groningen in the Netherlands, boasts extremely powerful functionality and a large community of developers and researchers. It is also one of the few open-source molecular dynamics software packages, which led to its selection as the primary software for the molecular dynamics simulation component of our project.

Gaussian: Unlike molecular dynamics simulations based on Newtonian classical mechanics, Gaussian adopts quantum chemistry fundamentals, simulating molecules by solving wave functions. In our modified CsgA-TET system, Gaussian is used to optimize the structure of unnatural amino acids and calculate important physical parameters such as charge numbers, facilitating the construction of CsgA-TET and further simulation of its behavior in aqueous solutions.

Force Fields

Several well-maintained force fields are suitable for simulating protein behavior in aqueous solutions, each developed and designed by the software's responsible developers. For example, the AMBER series of force fields, from ff94 to ff19SB, is primarily used for protein and nucleic acid simulations, continuously optimizing and improving charge distribution and dihedral parameters for high accuracy. The CHARMM series of force fields is widely applied to biomolecular systems, ranging from CHARMM19 to CHARMM36m, and extends to lipids, carbohydrates, and other biomolecules. In addition to these all-atom molecular force fields, coarse-grained force fields based on approximate, simplified protein models, such as MARTINI and UNRES, reduce computational precision but enable complex, long-term simulations.

In our study, the simulation of CsgA-TET's physical properties in aqueous solution will use the higher-precision AMBER99sb-ILDN, while the protein aggregation process will be simulated using the coarse-grained UNRES. This design ensures maximum computational efficiency at different time and molecular scale levels, making the most rational use of computational resources.

Results and Analysis

Study of CsgA-TET Physical Properties in Aqueous Solution

In the early stages of the experiment, due to the numerous possible unnatural amino acid substitution sites, molecular dynamics was used to predict the optimal substitution sites. By constructing the structural files of CsgA-TET and using the AMBER99sb-ILDN force field for GROMACS-based molecular dynamics calculations to simulate its behavior under physiological conditions, we can obtain high-precision molecular behavior trajectories, providing data for further analysis of protein physicochemical properties.

A single CsgA protein consists of 151 residues and has a width of approximately 31Å. It contains five hydrophobic repeat sequences that form a β-barrel structure primarily composed of β-sheets (Figure 1). The crystal structure and PDB file of CsgA were resolved and published by Fan Bu et al1. using cryo-electron microscopy, providing the initial files for CsgA-TET molecular dynamics simulations.

Figure_1
Figure 1. The crystal structure of CsgA protein (PDB ID: 6V5B)

First, we manually optimized the structure of the manually constructed tetracyano unnatural amino acid (TET) using Gaussian-based quantum chemistry simulations. The calculations employed the DFT method, using the classic B3LYP-D3(BJ)/def2-SVP basis set, which is undoubtedly more than adequate for small molecule structure optimization. During the optimization process, single-point energy calculations were performed simultaneously to generate molecular surface electrostatic potential data. Multiwf2 was used to calculate the RESP2 charges3 on the surface of the tetrazine amino acid to create a high-precision tetrazine unnatural amino acid insertion file. Additionally, the TET molecule PDB file data was used to modify the rtp file of the AMBER99SB-ILDN force field, enabling the classical force field-based protein simulation to be compatible with unnatural amino acids.

Next, we manually replaced TET at different sites in the CsgA protein to construct structural files for CsgA-TET with different substitution sites (Figure 2). The promising positions 97, 89, 50, and 48 were proposed by wet lab experiments and manually mutated, creating four mutants: F97T, S89T, Y50T, and Y48T (Figure 3). The mutant proteins underwent preliminary energy minimization using a combination of steep descent and conjugate gradient methods. A 0.15 mol/L NaCl aqueous solution was used to simulate the behavior of the mutant proteins under physiological conditions (Figure 4), with a temperature of 300K and a simulation time of 5ns.

tet
Figure 2. Structure of TET
Since the unnatural amino acids were not included in the visualization software script (vmd), the non-natural amino acids of the mutant proteins did not have the double bond pattern of the benzene ring and tetrazine.
f97t s89t y50t y48t
Figure 3. The structure of CsgA-TET with different unnatural amino acid insertion sites
sasa
Figure 4. Protein simulation trajectory

The conformations from the last frame of all MD simulations were extracted for comparison. Furthermore, given that CsgA protein monomers interact through hydrophobic forces due to their unique β-sheet structure, exploring the water solubility or hydrophobic capability of different mutation sites became essential. Figure 5 shows a comparison of the hydrophilic and hydrophobic solvent-accessible surface areas (SASA) for the four mutants. The hydrophilic SASA was roughly defined, based on empirical data, as residues on the protein surface with charge outside the range [-0.2, 0.2], calculated using the default probe radius (0.14nm). The hydrophobic SASA was calculated for residues with charges within the range [-0.2, 0.2]. This approximation allowed us to qualitatively investigate the hydrophobic and hydrophilic capabilities of proteins with different mutation sites.

sasa
Figure 5. Solvent-accessible surface areas

Through ANOVA analysis, we found that there were intergroup differences in the solvent-accessible surface areas among the four mutation sites, with F97T showing superior hydrophobicity compared to other mutation sites. The potential comparison between the mutants is shown in Figure 6, which showcases the stability of F97T outperforms in the four mutants. This information was reported to the wet lab team to guide further research.

Figure 6. Mean of Total Potencial

CsgA Dimerization

The dimerization of CsgA monomer proteins is a crucial mechanism in the E. coli flagellar secretion system, and our Curmino system primarily focuses on this mechanism. Therefore, exploring the fundamental mechanisms provides more detailed insights into bacterial secretion behavior and future drug design, becoming another focal point of molecular dynamics. Thermodynamic analysis of protein nucleation behavior primarily involves free energy analysis, which potentially provides numerous data on protein properties, such as binding rates and stability, especially for the stable β-barrel CsgA protein with minimal structural changes. We simplified the aggregation process of CsgA protein into a model of two monomer proteins dimerizing, attempting a simple molecular dynamics simulation by manually separating the dimer protein to a certain distance.

However, during the exploration process, we encountered local minima (Figure 5). Due to hydrophobic interactions, the proteins did not connect end-to-end as we had imagined, but rather joined tangentially due to the phenyl rings and alanine structures of hydrophobic residues on the sides. This result differed significantly from that obtained through electron microscopy analysis (Figure 6, PDB ID: 8enq).

Figure 7. Error Aggregation
Figure 8. Standard Conf

To overcome this phenomenon, we engaged in extensive discussions and investigations, receiving advice from senior laboratory members and peers from other institutions. Previous researchers have accumulated considerable experience in addressing insufficient sampling due to local energy minima in amyloid protein dimerization, overcoming local energy wells, and finding the thermodynamically optimal dimerization process. An obvious method is long-timescale calculations until the system reaches thermodynamic equilibrium. The timescale for side-chain conformational changes of proteins in water is in picoseconds, secondary structure transitions (such as β-sheet and α-helix formation) occur on the nanosecond scale, and protein folding rates are in the microsecond to millisecond range. These are all well-suited for molecular dynamics calculations in picosecond units. However, macroscopic protein aggregation behavior occurs on millisecond scales and above, and the order of magnitude increase in computational cost is prohibitively expensive.

Nevertheless, some tools can significantly reduce the cost of molecular dynamics simulations. Discrete GB implicit solvent models4, coarse-grained force fields like MARTINI5, and event-driven discrete molecular dynamics (DMD) simulations6 all provide approaches to reduce computational resources for long-timescale protein behavior simulations. In 2021, Feng Ding et al.7 used DMD to improve computational efficiency in studying the α-β structural transition of four monomers of the key Alzheimer's amyloid Tau protein R1-4, obtaining long-timescale conformational change behavior of Tau protein.

Beyond improving computational efficiency, a more powerful method—enhanced sampling8—is widely applied by researchers. The essential purpose of enhanced sampling is to overcome sampling errors caused by local sampling by expanding the sampling surface. Classic methods such as replica exchange molecular dynamics (REMD), umbrella sampling, and metadynamics are well-known. Their principles mostly involve adding global or local biases to the system to sample regions that are difficult to access through ordinary sampling. In an excellent 2017 study on the CRISPR-Cas9 protein system, J. Andrew McCammon et al.9 employed the collective variable (CV)-independent Gaussian accelerated molecular dynamics (GaMD) method, gradually overcoming local energy wells by adding Gaussian potentials globally. These methods can also be combined for better results. For example, a review on protein aggregation in crowded environments mentioned the combined use of implicit solvent models, coarse-grained force fields, and REMD, providing a comprehensive and detailed calculation of protein peptide solvent behavior in the presence of crowders through this three-level multidimensional model.

In our research, the REMD+REUS (Replica Exchange Umbrella Sampling) tool became our basic method for analyzing the binding free energy of CsgA protein monomers. REMD is a sampling technique independent of collective variables, requiring no definition of special reaction coordinates, but merely enhancing sampling efficiency globally through temperature scaling. Its basic sampling strategy is shown in Figure 8, where for multiple systems simulated in parallel at different temperatures, attempts are made to exchange conformations between adjacent temperature systems at regular intervals. This exchange is based on the Metropolis criterion10:

Where ξ is the reaction coordinate, q is the system coordinate, H is the Hamiltonian, and β is 1/kBT. Under this attempt, the system accepts lower energy conformations while also providing some exchange opportunities for higher energy conformations, enhancing the system's exploration of space and allowing it to escape low-energy wells.

REUS is an extension of US, with basic principles similar to the reaction coordinate-dependent US. After defining the reaction coordinate, a quadratic bias potential is added along the reaction coordinate:

Where k is a user-defined force constant, and ξ0 is the manually selected target reaction coordinate. After definition, the system will tend to explore the space of the target reaction coordinate, enhancing local sampling efficiency.

Figure 9. REMD sampling
Figure 10. US sampling

The data obtained from the above enhanced sampling methods are biased samples obtained by artificially increasing sampling efficiency. To make the sampling data unbiased, the Weighted Histogram Analysis Method (WHAM) is widely used. Its principle lies in calculating probabilities in the form of histograms for the sampled data and using the Boltzmann factor for unbiased correction:

Where kB is the Boltzmann constant, Pu is the unbiased probability distribution of the system, Gk is the system free energy, nkt is the sampling amount of a single window, nt and is the number of samples in a single bin of the histogram. Through Boltzmann correction of the sampling windows, iteratively solving the above function to convergence yields the system's free energy and probability distribution.

Building upon WHAM, the Bennett Acceptance Ratio (BAR)11 was proposed, utilizing the Boltzmann distribution of the canonical ensemble. By calculating the weighted partition function, it derives the relative free energy between two states:

To address the limitation of BAR in describing only two states, the Multistate Bennett Acceptance Ratio (MBAR)12 was further developed:

In this equation, Ni and Nj denote the number of samples in individual windows, fi represents the total free energy of the system in state i, and K is the number of states. This method provided crucial guidance for our post-analysis of two-dimensional multi-window, multi-replica data.

For our investigation of CsgA protein dimerization, we employed the AMBER99SB-ILDN force field (in preparation for future studies on unnatural amino acid mutant proteins) and conducted simulations in a 0.15 M sodium chloride solution. We defined the center of mass distance as our reaction coordinate, a common approach in such studies. Based on the results of center of mass pulling(Figure 11), we selected 8 biased coordinates. For each coordinate, we constructed replicas at two temperatures, 298.15 K and 398.15 K, for Replica Exchange Molecular Dynamics (REMD). Concurrently, we performed a 1 ns simulation comprising 250,000 steps, attempting conformational exchange every 1,000 steps. The resulting data underwent further MBAR analysis to obtain a two-dimensional (8*2) free energy landscape (FES) of CsgA protein dimerization.

We simplified the STRING model for protein binding free energy analysis proposed by Benoît Roux et al.13, employing the protein center of mass distance and the Θ angle as our reaction coordinates. The incorporation of the Θ angle was motivated by pre-simulation observations: as CsgA protein monomers separate, the Θ angle decreases, indicating that protein separation is driven by hydrophobic interactions causing the angular regions of the two proteins to attract and approach each other. We designated the center of mass of residues 49-53 of the CsgA1 protein as the third reference point, as illustrated in Figure 12.

Our research utilized Gromacs with the PLUMED plugin14, facilitating the definition of desired reaction coordinates and enabling multi-window, multi-replica simulations. We conducted a total of 16 ns of replica exchange simulation. The generated potential energy and reaction coordinates were analyzed using pymbar (Shirts & Chodera, 2008; Shirts et al., 2023) for MBAR analysis, producing a two-dimensional FES graph (Figure 13). This graph clearly depicts the relationship between free energy changes and variations in Θ value and center of mass distance. As the center of mass distance increases, Θ explores the entire conformational space, aligning with our expectations. However, Θ also exhibits a decreasing trend, indicating that the angular regions of the two CsgA proteins are approaching each other (Figure 12). Symmetry considerations suggest that there should also be a trend of increasing Θ angle, causing the angular regions on the opposite sides of the two proteins to converge. This discrepancy arises from our choice of initial conformation; we were unable to achieve complete conformational space sampling. Potentially, increasing the bias potential of Θ could address this issue.

Figure 14 illustrates the free energy changes across different sampling states, with the free energy of the distant window at a center of mass distance of 2.835 nm set as the reference point (zero). Notably, window 5, corresponding to a center of mass distance of 2.523 nm, exhibits the minimum relative free energy for the system. This represents an appropriate center of mass distance. Using states 5 and 16 for a rough estimation of binding free energy, we obtained ΔG = -3.9031 (± 0.0788) kT, approximately -2.31 kcal/mol. This value is relatively significant and demonstrates the strong aggregation propensity of CsgA amyloid protein.

Figure 11. Results of center of mass pulling
Figure 12. Model structure
Figure 13. Two-dimensional FES graph
Figure 14. Free energy changes across different sampling states
Conclusion and Future Prospects

This study conducted detailed simulations on the thermodynamic parameters of unnatural amino acids and nucleation reactions in CsgA protein, yielding comprehensive and detailed data. The highlights of this research include:

  • Expansion of the unnatural amino acid library
  • Realization of the potential for TET4.0 residue protein point mutations

The next phase of research will focus on employing enhanced sampling methods to study the aggregation thermodynamic parameters of CsgA proteins with unnatural amino acid tetrazine point mutations. This endeavor aims to provide thorough and detailed foundational research for the development of the CURMINO platform, while simultaneously offering robust data to support further drug testing and simulation efforts.

This comprehensive approach not only advances our understanding of CsgA protein behavior but also lays a solid groundwork for future developments in protein engineering and drug design. The integration of unnatural amino acids and advanced computational techniques opens new avenues for exploring protein dynamics and interactions, potentially leading to innovative therapeutic strategies.

References

1. Bu, F.; Dee, D. R.; Liu, B., Structural insight into Escherichia coli CsgA amyloid fibril assembly. 2024, 15 (4), e00419-24

2. Lu, T.; Chen, F., Multiwfn: A multifunctional wavefunction analyzer. 2012, 33 (5), 580-592.

3. Schauperl M, Nerenberg P, Jang H, Wang L-P, Bayly CI, Mobley D, et al. Force Field Partial Charges with Restrained Electrostatic Potential 2 (RESP2). ChemRxiv. 2019

4. Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T., Semianalytical treatment of solvation for molecular mechanics and dynamics. Journal of the American Chemical Society 1990, 112 (16), 6127-6129.

5. Marrink, S. J.; de Vries, A. H.; Mark, A. E., Coarse Grained Model for Semiquantitative Lipid Simulations. The Journal of Physical Chemistry B 2004, 108 (2), 750-760.

6. Shirvanyants, D.; Ding, F.; Tsao, D.; Ramachandran, S.; Dokholyan, N. V., Discrete Molecular Dynamics: An Efficient And Versatile Simulation Method For Fine Protein Characterization. The Journal of Physical Chemistry B 2012, 116 (29), 8375-8382.

7. He, H.; Liu, Y.; Sun, Y.; Ding, F., Misfolding and Self-Assembly Dynamics of Microtubule-Binding Repeats of the Alzheimer-Related Protein Tau. J Chem Inf Model 2021, 61 (6), 2916-2925.

8. Liao, Q., Enhanced sampling and free energy calculations for protein simulations. Prog Mol Biol Transl Sci 2020, 170, 177-213.

9. Palermo, G.; Miao, Y.; Walker, R. C.; Jinek, M.; McCammon, J. A., CRISPR-Cas9 conformational activation as elucidated from enhanced molecular simulations. Proc Natl Acad Sci U S A 2017, 114 (28), 7260-7265.

10. Qi, R.; Wei, G.; Ma, B.; Nussinov, R., Replica Exchange Molecular Dynamics: A Practical Application Protocol with Solutions to Common Problems and a Peptide Aggregation and Self-Assembly Example. Methods in molecular biology (Clifton, N.J.) 2018, 1777, 101-119.

11. Bennett, C. H., Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics 1976, 22 (2), 245-268.

12. Shirts, M. R.; Chodera, J. D., Statistically optimal analysis of samples from multiple equilibrium states. J Chem Phys 2008, 129 (12), 124105.

13. Suh, D.; Jo, S.; Jiang, W.; Chipot, C.; Roux, B., String Method for Protein-Protein Binding Free-Energy Calculations. J Chem Theory Comput 2019, 15 (11), 5829-5844.

14. Tribello, G. A.; Bonomi, M.; Branduardi, D.; Camilloni, C.; Bussi, G., PLUMED 2: New feathers for an old bird. Computer Physics Communications 2014, 185 (2), 604-613.