Part One On-target and Off-target Digging and gRNA Designing

The HBsAg gene sequence is contained in the rcDNA, cccDNA, and DNA integrated into the host genome of the hepatitis B virus. After literature research and internal discussions within our team, we confirmed that eliminating the HBsAg gene is the most effective means to remove various forms of HBV DNA and reduce hepatitis B detection indicators. Therefore, we decided to design the MAD7 gene editing recognition site in the HBsAg gene.

We first queried the HBsAg gene sequence on NCBI, and by comparing the gene sequences of different HBV subtypes, we obtained the conserved sequence of HBsAg. Using the MAD7 PAM sequence (5'-YTTV-3'), we indexed it on the conserved sequence of HBsAg, identifying 33 potential spacer design sites. At the complementary strand of each PAM site, we marked the first 20 bases at the 5' end of PAM as candidate on-target sequences, respectively.

To ensure the safety of in vivo gene editing therapy as much as possible, it is necessary to reduce the number of human genes similar to the on-target sequence. Cas-OFFinder can screen for sequences with the same PAM and length as the given sgRNA across the whole genome. By comparing and calculating mismatches, insertions, and deletions between them, potential off-target sites meeting the criteria can be selected.

We searched in Cas-OFFinder for human genes similar to the 33 candidate on-target sequences, and tabulated the number of human genes with fewer than four base differences. By comparing the number of similar human genes for each candidate on-target, we selected candidate No. 24, which had the fewest similar human genes, as our on-target.

Table 1 Number of human genes with 1-3 mismatches to selected on-targets

Based on the determined on-target sequence, we designed a 20bp spacer. By connecting the crRNA with the spacer sequence, we obtained the MAD7 gRNA sequence.

Due to the weak steric hindrance of amino acids, the DNA target far from the PAM end binds to the gRNA with a higher tolerance for mismatches. Following these off-target rules, we introduced two A mutations at the far PAM end of this on-target sequence, replacing the original sequence with a synthetically designed off-target for MAD7 mutant verification.

Part Two Analysis and Prediction of High Precision MAD7 Mutation Sites

Abstract

This study explores the impact of Cas12a mutation sites on the specificity of gRNA and DNA binding, aiming to reduce the off-target rate during the gene editing process of the MAD7 protein. Through analysis of wet lab data, we observed that mutations in residues located 20 to 32 Å away from the DNA are most likely to decrease the off-target rate. By analyzing the specific mutation K739T, we examined the effects of amino acid property changes on the off-target rate of the MAD7 enzyme, cross-validating our findings with zero-sample mutation predictions conducted on the SaprotHub platform. From the data, we identified a set of beneficial sites that reduce the off-target rate, providing potential mutation areas as important theoretical and experimental guidelines for the optimization of future gene editing tools.

2.1 Structural Analysis of MAD7 Low Off-target Mutation Sites

We first conducted a statistical analysis of twelve MAD7 mutant variants in wet lab experiments(Fig.1), calculating the minimum distance from the mutation sites to both DNA strands and plotted the relationship between different mutation sites' distances and off-target rates(Fig.2). We introduced two translucent fan-shaped frames with radii of 20 and 32 angstroms, respectively, to mark the distance levels to DNA. Our observations revealed that mutation sites closest to DNA exhibited higher off-target rates, while those slightly further away clustered and significantly impacted the off-target rate. The most distant sites showed sparse data with almost no extreme off-target rates. The data suggest that amino acids in the first layer of contact with DNA may cause local structural changes in DNA, interfering with target specificity. In contrast, sites in the second layer of contact can maintain structural integrity within their local environment, balancing specificity and adaptability, and enhancing targeting effectiveness. Mutation sites located far from the DNA have a smaller impact on the protein's interaction with DNA/RNA. The influence on the off-target rate depends on amino acids located closer in spatial proximity.

Fig.1 Off-target rates of different mutants measured in the experiment.

Fig. 2 Point Cloud of Different Mutant Series. Distance from amino acids to DNA in different mutation series. Minimum distances from mutation sites to the C3 atom of each deoxyribonucleotide across both DNA strands, labeled as Distance 1 and Distance 2, are shown in angstroms. The color intensity of the plotted points represents off-target rates, with deeper reds indicating higher rates and blues indicating lower rates. The varying shades of green in the quarter-circles represent ranges of 20 Å and 32 Å, respectively.

Given that the MAD7 protein does not yet have a definitive crystal structure, we referred to the ternary complex structures of two similar proteins, 6I1K and 5MGA, and predicted a high-confidence ternary complex structure using AlphaFold3 ^[1] to analyze the mechanism of action of mutations. We particularly focused on the best-performing mutant No. 6 in wet experiments. This mutant contains two mutation sites, K739T and R1062H. R1062H is located on the edge of the protein and, based on the above, is inferred to have a minimal impact. Therefore, our analysis focused on K739T. We used PyMOL for point mutation analysis(Fig.3), which showed from multiple angles the effects of the K739T mutation on binding. Firstly, considering amino acid properties, this mutation directly reduces a hydrogen bond formed between the protein and DNA. Lysine (K) is a positively charged basic amino acid that binds to the negatively charged DNA backbone through electrostatic interactions and hydrogen bonding at physiological pH. In contrast, threonine (T), being a neutral, polar amino acid, does not carry a charge at physiological pH, and its hydroxyl group's ability to form hydrogen bonds is far inferior to that of lysine. Thus, the K739T mutation reduces the electrostatic attraction between the protein and DNA, weakening the binding strength of the MAD7 enzyme with DNA. The reason for off-target effects is that the gRNA does not strictly complementarily pair with DNA, and the weakened binding strength makes it impossible to recognize if the pairing between gRNA and DNA is not strict, hence decreasing the off-target rate. Additionally, changes in the hydrogen bond network are also significant factors. The K739T mutation results in a significant reduction in the number of hydrogen bonds between the amino acid and DNA and surrounding residues, further weakening the binding capability of the MAD7 enzyme with DNA. The structural and conformational changes induced by the mutation may alter the shape of the active site and DNA binding region, leading to a looser protein structure and reduced off-target rates. Overall, the K739T mutation significantly improves sequence specificity and greatly reduces the off-target rate by directly affecting the charge and polarity of the amino acids.

Fig.3 LYS-739 and THR-739. Spatial conformation of the unmutated 739 and Spatial conformation of the mutated 739.

2.2 High-Precision MAD7 Mutation Site Prediction

After completing the interpretation of wet lab experiments, we used the SaprotHub platform for zero-sample mutation effect prediction ^[2]. Leveraging its advanced protein language model and user-friendly operational framework, we scored and analyzed mutations across all sites on the MAD7 protein. This scoring assesses the impact of individual mutation sites on the entire protein structure and represents a characterization of the protein's evolutionary potential and trends. To identify hotspot regions of the active site, we analyzed the correlation between the distances of mutation sites and their predictive averages(Fig.4). The distribution of average values across all sites shows that most sites have minimal impact on the protein. The red dots in the figure indicate the mutation sites from the experiment. Most data points are located on the "average = 0" line and in the dark color areas, indicating these mutations have limited overall impact on the protein. Low off-target sites are concentrated at points that moderately affect the protein, between -5 and -2. As inferred from 2.1, the further a mutation site is from DNA, the smaller its impact on the off-target rate. Therefore, we focused on the region between average values of -3 to -5 and distances from DNA of 30 to 40 Å, marked by the yellow circle in the diagram, as we believe that these sites possess the ability to influence protein function while remaining relatively stable. Notably, site 739 also falls within this region, a finding that is surprising yet fits logical expectations.

Fig.4 Mean Score vs Minimum DNA Distance with Variance Gradient.The vertical axis displays the minimum distances from mutation sites to the C3 atom of each deoxyribonucleotide on both strands of DNA, measured in angstroms.The horizontal axis displays the predictive average values of these sites, while the vertical axis indicates the minimal distance of amino acids to DNA. The color intensity of the points on the plot signifies the variance of mutations at each site, with darker shades representing higher variances. Red points highlight experimental mutation sites: deep red signifies off-target rates less than 0.10, light red for rates between 0.1 and 0.5, and bright red for rates exceeding 0.5.

Part Three Predictive Design of Mutation Sites Affecting sacB Toxicity

Abstract

The sacB gene encodes a fructosyltransferase that catalyzes the hydrolysis of sucrose to produce levans, which, when accumulated in Gram-negative bacteria, can lead to their death^[3]. To develop a more versatile screening toolkit of sacB with different lethality capabilities, we predicted single and combined mutation sites affecting sacB toxicity through molecular docking simulations and Funclib. Integrating predictions from different technologies, we proposed sacB_S164A and sacB_E262Q mutants with varying toxicity levels and validated their toxicity through wet lab experiments. These two mutants, along with sacB_S164T and sacB_WT proposed by the DUT_China team in 2022, enriched our sacB toolkit covering different lethality rates.

3.1 Molecular Docking Predicts Different Toxicity of sacB Mutation Sites

When bacteria grow in sucrose-containing media, SacB enzyme converts sucrose to levans. Levans is a large molecular polysaccharide accumulated in the outer membrane of bacterial cells. Its accumulation in Gram-negative bacteria can lead to impaired cell membrane function, thereby exerting toxic effects. To identify critical sites affecting enzyme binding with sucrose, we performed molecular docking of fructosyltransferase with sucrose using AutoDock Vina^[4][5], with sucrose molecule information from the PubChem database^[6] and protein structure data from the PDB database, running the docking program in a Cygwin64 environment. By optimizing the size and position of the docking box, we improved the accuracy and precision of docking simulations. We selected 15 docking results with high affinity and similar to real crystal structures for analysis, counted the number of polar interactions and hydrogen bonds formed between sacB residues and the ligand, and ranked them(Fig.5). According to the results, we identified 13 critical sites for substrate binding: 164, 342, 86, 85, 163, 246, 360, 412, 433, 411, 343, 247, 262. Based on literature and crystal structures, these sites are presumed to significantly impact the catalytic activity of fructosyltransferase. We selected 10 sites that intersect with literature for sacB mutation: 86, 164, 246, 247, 262, 340, 342, 343, 360, 411. Some sites such as 86, 164, 247, 340, 342, have mutation data that can serve as evidence for our experimental results.^[6]

Fig.5 The Number of Polar Interactions and Hydrogen

3.2 Using Funclib to Predict Combined Mutations of sacB

Based on the analysis, we used FunLib^[7]—an automated multi-point mutation design method based on phylogenetic analysis and Rosetta design calculations—to perform mutation space searches on these 10 high-impact sites and predicted the stability and affinity of the enzyme for fructose receptors after multiple mutations. Considering the stability and affinity post-mutation, we scored and ranked the results using the following formulas to negate the effects of sign and absolute magnitude. We quantified post-mutation stability, binding ability with small molecules, and overall capacity, with the calculation formulas as follows:

- Formula for Stability Calculation

- Formula for Binding Ability Calculation

- Overall Formula

Through this method, we selected mutations that surpassed the wild-type in stability, small molecule binding ability, and overall capacity. Then we quantified the proportion of different mutations at each unit point superior to the wild-type, verifying mutation results described in the literature, such as 86: S>N>T>D, 247: D>N, etc. We also tested new sites 262 and 164, conducting experimental validations for E262Q and S164A mutations.

Fig.6 Overall Scoring Chart. The x-axis represents the mutation sites, while the y-axis displays the count of specific mutations at each site in mutation combinations that outperform the wild type. The frequency of each mutation is measured across the spectrum of tested combinations, providing a quantitative assessment of mutational impacts relative to the baseline wild type performance.

Fig.7 Stability Scoring Chart. The horizontal axis displays the mutation sites, while the vertical axis indicates the count of specific mutations at each site across mutation combinations that demonstrate greater stability than the wild type. This chart quantitatively assesses the stabilizing effects of mutations, highlighting those that enhance protein stability beyond the baseline wild type characteristics.

Fig.8 Small Molecule Binding Ability Scoring Chart. The horizontal axis displays the mutation sites, while the vertical axis indicates the count of specific mutations at each site across mutation combinations that demonstrate greater small molecule binding ability than the wild type. This chart quantitatively assesses the binding capacity enhancements of mutations, highlighting those that improve protein interaction with small molecules beyond the baseline wild type characteristics.

Here are the final results:

**Overall Scores:**(Fig.6)
- 86: D>N>S>T
- 164: A>S>T
- 246: Q>K>R
- 247: D>N
- 262: Q>E
- 340: S>I>V>N>D>E>Q
- 342: S>A>Q>E
- 343: R>H>Q>N>K
- 360: S>T>K>N>Q>R>F>E>H
- 411: H>Y

**Stability:**(Fig.7)
- 86: S>N>T>D
- 164: A>S>T
- 246: Q>R>K
- 247: D>N
- 262: Q>E
- 340: S>Q>E>I>V>N>D
- 342: E>S>Q>A
- 343: R>N>Q>H>K
- 360: S>E>T>Q>F>K>Y>N>R>>H
- 411: Y>H

**Binding Ability:**(Fig.8)
- 86: D>>N>>S>T
- 164: A>S>T
- 246: K>>R>Q
- 247: D>>N
- 262: Q>E
- 340: I>N>V>S>D>>E>Q
- 342: A>Q>S>E
- 343: H>Q>R>N>>K
- 360: K>S>T>N>R>H>F>Q>Y>E
- 411: H>>Y

Finally, we have established a toxicity gradient model as depicted:WT>E262Q>S164T>S164A.(Fig.9)

sacB

Reference

[1] Abramson, J., Adler, J., Dunger, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024). https://doi.org/10.1038/s41586-024-07487-w
[2] SaprotHub: Making Protein Modeling Accessible to All Biologists
[3] Jin Su, Zhikai Li, Chenchen Han, Yuyang Zhou, Yan He, Junjie Shan, Xibin Zhou, Xing Chang, Dacheng Ma, The OPMC, Martin Steinegger, Sergey Ovchinnikov, Fajie Yuan
[4] bioRxiv 2024.05.24.595648; doi: https://doi.org/10.1101/2024.05.24.595648
[5] Huang, H., Huang, G., Tan, Z. et al. Engineered Cas12a-Plus nuclease enables gene editing with enhanced activity and specificity. BMC Biol 20, 91 (2022). https://doi.org/10.1186/s12915-022-01296-1
[6] Eberhardt, J., Santos-Martins, D., Tillack, A.F., Forli, S. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling.
[7] Trott, O., & Olson, A. J. (2010). AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2), 455-461.