A promoter is a critical regulatory region in the DNA sequence located upstream of a gene, responsible for regulating gene transcription activity. It serves as the binding site for RNA polymerase and other transcription factors, determining the transcription start site and efficiency. Promoters typically contain specific cis-regulatory elements, such as the TATA box, CAAT box, and GC box, which interact with transcription factors to influence gene expression levels.
In previous studies, machine learning-based sequence prediction of promoter transcription levels has been explored extensively. However, there is a lack of research focused on specific predictions for mutated promoters. For instance, constructing a predictor for the E. coli tac promoter and its mutants, along with a corresponding generator to create specific expression levels, has been largely unstudied. This research aims to address this gap.
We first generated a dataset of tac promoter mutants through random mutations, then combined a CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory) to build a sequence-based predictor. Simultaneously, we employed a GA (Genetic Algorithm) to design a generator, which was integrated with the CNN-LSTM predictor as the evaluation module for the genetic algorithm. Using previously obtained tac mutant sequences as the initial population, we applied crossover, mutation, and other operations through the genetic algorithm, evaluated by the predictor, to design a new batch of tac mutant promoters based on prior tac mutant sequences. After experimental validation, we replaced the generator with a more interpretable statistical model based on binomial distribution. Through operations such as confidence intervals and hypothesis testing, we identified high-frequency base-position combinations of strong promoters, completing our three-cycle design-test-learn loop.
We decided to use a combination of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks to train our dataset. Since our dataset is relatively small, we implemented several methods to reduce the possibility of overfitting.
Data Preprocessing: The sequence data is first transformed into numerical encoding using the `encode_sequence` function, where DNA bases (A, C, G, T) are mapped to 0, 1, 2, and 3, respectively. We did not use one-hot encoding in this case because with one-hot encoding, each category (such as DNA bases) is represented by an independent binary vector, which significantly increases the feature dimension. This increased number of features can make the model more prone to fitting the noise in the training set, thus raising the risk of overfitting.
Normalization: We use Z-Score normalization to normalize the data, reducing the impact of extreme values.
Next is the core part of our CNN-LSTM model, which I will explain layer by layer.
Convolutional Layer: Conv1D is used to extract local features from the sequence. We set 32 filters with a kernel size of 3, and the activation function is ReLU, which introduces non-linearity. The convolution operation captures local patterns from adjacent positions in the sequence (e.g., triplets in a DNA sequence).
Pooling Layer: Pooling reduces the dimensionality of features, lowers computational costs, and helps prevent overfitting. The pooling window size is 3, meaning the maximum value is taken from every three feature values.
Dropout Layer: Randomly drops 50% of the neurons to further prevent overfitting and improve the model’s generalization ability. This way, the model won’t rely too heavily on specific features during training, which enhances its generalization capacity.
Batch Normalization Layer: Batch normalization is used to accelerate training and stabilize the learning process by normalizing the input and maintaining a stable output distribution.
LSTM Layers: Two LSTM layers are used to handle sequence data. LSTM has a memory mechanism that captures long-term dependencies in sequences. The first LSTM layer outputs the sequence for all time steps (return_sequences=True), while the second LSTM layer only outputs the state of the last time step, providing input to the subsequent fully connected layer. LSTM can capture positional dependencies in DNA sequences, such as relationships between different bases in the promoter region and gene sequences.
Flatten Layer: The output from the previous LSTM layer is flattened into a vector for processing by the fully connected layers.
Fully Connected Layer: The first dense layer (Dense(32)) is fully connected with 32 neurons to further extract features, using ReLU as the activation function. The final layer (Dense(1)) has a single neuron output with a linear activation function (activation='linear'), producing the final prediction value, which is the promoter strength.
Given the limited amount of data, we used various regularization techniques (e.g., Dropout, Batch Normalization, and Early Stopping) to prevent overfitting and conducted multiple training sessions to find the optimal model.
Final Model Performance on the test set:
These results significantly outperform the predictor reported by Wang et al. in Nucleic Acids Research in 2018 (PCC = 0.514).
It is worth mentioning that there was a particular data point that caught our attention. As shown in the left image, the highest value in the original dataset—a mutated tac promoter with an expression level exceeding 40,000—was consistently predicted by our model to be around 30,000 during multiple training sessions. Initially, we thought this was due to the model not fitting well. However, in the final round of experimental validation, the protein expression level for this promoter was measured at 30,279, indicating that the previous value of over 40,000 was erroneous. This aligns closely with our model's prediction of 30,000, demonstrating that our model was able to identify a hidden data anomaly! This is undoubtedly exciting and once again proves the effectiveness of our model.
Initially, we considered that since we are optimizing DNA sequences like promoters, could we apply a genetic algorithm? Genetic algorithms are heuristic search methods that mimic the mechanisms of biological evolution. They optimize the predicted strength of promoter sequences through the following steps:
The population size is set to population_size, and the initial population is derived from the provided original mutated tac promoter sequences (passed through initial_population). Each individual in the population (promoter sequence) will undergo subsequent evolutionary processes.
The predicted transcription strength of each promoter sequence is evaluated using the pre-built CNN-LSTM predictor, which serves as the fitness score, referred to as fitness_scores. The code records the best fitness value for each generation and outputs it.
Based on the fitness scores, the best-performing individuals (promoter sequences with high fitness scores) are selected, retaining the top 50% of the population. The selection method involves using np.argsort(fitness_scores) to find the indices of the sequences with the highest fitness.
From the selected sequences, two parents are randomly chosen. A crossover point is used to combine segments from both parents, generating two new offspring sequences. This crossover process mimics biological genetic crossing by incorporating segments from parent sequences into the offspring, thus producing new individuals.
A mutation operation randomly selects a position in a sequence with a 10% probability and mutates the base at that position to another randomly chosen base. The mutation operation helps maintain the diversity of the population and prevents it from getting trapped in local optima.
The offspring generated through crossover and mutation are added to the selected parent sequences, forming the next generation of the population. This process is repeated until the set number of generations (generations) is reached.
Once all generations (e.g., 50 generations) are completed, the algorithm evaluates all sequences in the final population and selects the top 50 promoter sequences along with their predicted transcription strengths as the final results of the genetic algorithm.
By simulating natural selection and the evolutionary process, genetic algorithms are very effective for complex problems (like optimizing DNA sequences), especially when the search space is large and finding a global optimum through traditional optimization methods is difficult. Through crossover and mutation, they explore different sequence combinations, gradually approaching the optimal promoter sequences.
To validate the generator model, we also generated poorer promoter sequences in reverse. However, the experimental results indicated that while a portion of our GA generator produced expected outcomes, the overall matching quality was still unsatisfactory and lacked interpretability. Therefore, we decided to modify the generator.
In studies on Corynebacterium glutamicum and Escherichia coli, a formula was developed to analyze the relationship between promoter sequences and their expression strength based on a binomial distribution model. This model is used to assess the position of each nucleotide in the promoter and its impact on promoter strength.
The binomial distribution model calculates whether the occurrence of a nucleotide at a specific position exceeds the expected value from random distribution. For instance, if the probability of adenine A appearing at a given position is expected to be p , and there are q promoter sequences, the probability of this nucleotide appearing x times at that position can be calculated using the binomial distribution. If the observed frequency is significantly higher than expected, it can be inferred that this nucleotide at that position may have a significant influence on promoter strength. This approach is a hypothesis testing method that incorporates confidence intervals.
Using these probabilities, nucleotides that frequently or infrequently appear in strong or weak promoters can be identified, and this information can be used to design new promoters.
In this context, x represents the number of occurrences of a specific nucleotide at a given position, *q* represents the total sample size, and p is the probability of that nucleotide appearing at the given position. By calculating the probability of each nucleotide at different positions, the key nucleotide sites related to promoter strength can be identified.
In practical applications, promoters are divided into two categories (strong promoters and weak promoters), with several promoter sequences in each category. The probability Pr of nucleotides in strong and weak promoters is calculated separately.
A larger or smaller Pr value indicates that the nucleotide-position combination is significantly associated with promoter strength—either negatively or positively correlated. For example, if the Pr value for a nucleotide-position combination is high in strong promoters, it suggests that this combination is relatively infrequent in strong promoters, whereas a lower Pr value indicates it is relatively frequent. The key point here is to understand that Pr does not represent the actual occurrence frequency but the expected occurrence of x or more nucleotides under normal conditions.
We optimized this model by dividing promoters into three categories: strong, medium, and weak. The q value still represents the total sample size, but the medium-strength promoters are excluded from the Pr calculation, enhancing the distinction between strong and weak promoters.
The mutants generated using this model ultimately achieved an expression level of 108% of the tac promoter in experiments. Considering that the predictor employed Z-score normalization to mitigate the effects of extreme values, the predicted expression range was relatively conservative, making it unlikely to predict mutants with significantly higher expression than the tac promoter. Nonetheless, the result aligns with expectations and is quite promising.
Signal peptides, typically composed of 20-30 amino acids, are found predominantly at the N-terminus of secretory proteins, although C-terminal signal peptides also exist. These peptides can be divided into three regions based on their amino acid composition and structure: the N, H, and C regions. The N region is rich in positively charged amino acids such as arginine and lysine, while the H region is a hydrophobic region often forming an alpha-helix structure. The hydrophobicity of the H region determines which molecular chaperones the signal peptide will bind to, thereby influencing the secretion pathway. Leucine and alanine are commonly found in this region. The C region is the hydrophilic cleavage site that includes specific residues recognized by signal peptidases, typically following the -3 and -1 positions according to the A-X-A rule.
Despite extensive research on signal peptides, research models specifically concentrating on signal peptides in Gram-negative bacteria are still rare. These peptides typically have shorter sequences, which has to some extent restricted research and model development in this field. This study is based on an existing RF regression model developed for signal peptides in Bacillus subtilis [1]. Initially, we predicted the secretion potential of 194 Escherichia coli signal peptides (http://www.signalpeptide.de/) using the original model. From these, we selected 40 signal peptides with a gradient of predicted secretion strengths and constructed them with ICCG and GFP for expression. Fluorescence intensity measurements were carried out to evaluate enzyme expression levels. Notably, the signal peptide nfaa, which was predicted to have the highest expression, indeed presented the highest experimental expression. However, due to poor mobility, the overall accuracy was not completely satisfactory, although the model still offered useful guidance for the experiments.
Subsequently, we modified the H region by generating new sequences using a comprehensive scoring matrix that integrated the BLOSUM62 substitution matrix, a hydrophobicity matrix, and the Markov transfer frequency matrix of Gram-positive bacteria [2]. Unfortunately, the experimental results did not exceed those of the wild-type NFAA. We then generated 60 new signal peptides using the same matrix and tested them with the original model, selecting those with higher predicted scores than the wild-type NFAA for experimental verification. However, the results were still not optimal. We believe this might be attributed to the Markov transfer frequency matrix, which was based on Gram-positive bacteria and might not be appropriately suitable for our target. Additionally, the RF regression model developed for Gram-positive bacteria showed poor sensitivity (±0.2) when applied to Gram-negative bacteria. Finally, we adjusted the original signal peptide prediction model using the dataset and completed three rounds of the design-test-learn cycle.
In the original model, We first used PSORT and SignalP 6.0 to analyze N-, H-, and C-regions, and a region referred to as "after cleavage" (Ac-region). Moreover, 156 feature values such as Length, MW, Charge, ChargeDensity, pI, InstabilityInd, Aromaticity, AliphaticInd and BomanInd of each region are extracted.
In this study, various bioinformatics tools were used to extract critical features from signal peptides. Using Biopython's Bio. SeqUtils. ProtParam. ProteinAnalysis. secondary_structure_fraction(), we obtained the fraction of amino acids that tend to form alpha-helices, turns, and beta-sheets. Additionally, Bio. SeqUtils. ProtParam. ProteinAnalysis. get_amino_acids_percent() was employed to calculate the relative frequency of amino acids.
We further utilized the GlobalDescriptor class from the modlAMP Python package with the calculate_all() method (pH=7.0, amide=False) to obtain values for Length, Molecular Weight (MW), Charge, Charge Density, Isoelectric Point (pI), Instability Index, Aromaticity, Aliphatic Index, Boman Index, and Hydrophobic Ratio.
The PeptideDescriptor class in modlAMP with the calculate_global() method (modality='mean') provided values for argos, eisenberg, ez, flexibility, gravy, hopp-woods, janin, and kyte-doolittle scales. RNA secondary structure propnsities, expressed as minimum free energy (kcal/mol), were computed using the RNA folding method from the ViennaRNA package. Additionally, a Boolean dummy variable was used to encode the amino acid at the -3 position of the AxA consensus cleavage site.
RSCU is 'the observed frequency of [a] codon divided by the frequency expected under the assumption of equal usage of the synonymous codons for an amino acid' In math terms, it is
“where is the number of occurrences of the codon for the amino acid, and is the number (from one to six) of alternative codons for the amino acid.
CAI is “the geometric mean of the RSCU values… corresponding to each of the codons used in that gene, divided by the maximum possible CAI for a gene of the same amino acid composition”.
In math terms, it is
The Random Forest regression model utilized several evaluation metrics, In this experiment, MAE was used to evaluate the overall error without giving more weight to larger errors, providing a robust measure of prediction error. MAE quantifies the average absolute difference between predicted and actual values. In math terms, it is
In math terms, it is
In this experiment, R² was used to evaluate the model’s ability to explain the variability of the data, with values closer to 1 indicating better fit. R² indicates the proportion of variance in the dependent variable that is predictable from the independent variables.
In math terms, it is
Information Gain was used to measure how well a feature, when used to split the data, reduces uncertainty (entropy) .In this experiment, IG was crucial for selecting the most informative features during tree construction in the Random Forest.
In math terms, it is
In this experiment, we used the Gini Index which is a measure of impurity or diversity in the data to choose the best splits in decision trees, where lower values indicate more homogenous groups.
In math terms, it is
In this experiment, entropy was utilized to help the model decide optimal splits during tree growth by minimizing uncertainty.
In math terms, it is
The final prediction in Random Forest was the result of aggregating predictions from all trees. In this experiment, we calculated the average prediction for regression tasks or used majority voting for classification tasks to generate robust outcomes.
In math terms, it is
Where TTT is the number of trees, and ft(x)f_t(x)ft(x) is the prediction from tree ttt10. In this experiment, we used accuracy to evaluate the model’s effectiveness in correctly classifying instances into their respective categories.
In math terms, it is
We selected the top-performing signaling peptides and integrated them with ICCG and green fluorescent protein (GFP). These constructs were cultivated in 96-well plates, and the green fluorescence intensity was gauged using a microplate reader to evaluate the production levels. The experimental outcomes are highly consistent with the model prediction. However, the RF regression model based on Gram-positive bacteria is less sensitive to the negative bacteria score of ±0.2, we optimized the model by using the generated small data set.
The features of the new signal peptides were extracted, scored, and then used to construct and culture new strains. These data were then incorporated into the dataset to train a more accurate Random Forest (RF) regression model, better reflecting the growth conditions of E. coli signal peptides.
Our Random Forest model is constructed using 1,024 decision trees, each trained through bootstrapped sampling of the data (bootstrap=True). The splitting criterion for the trees is the minimization of mean squared error, ensuring that at each split, the feature and threshold that result in the smallest prediction error are selected. The trees are allowed to grow up to a maximum depth of 256, balancing model complexity with overfitting prevention. Features are randomly selected at each split according to the square root of the total number of features, which enhances model robustness by reducing correlation among the trees. Additionally, the model employs a warm-start strategy, allowing for iterative refinement of the model by reusing previous results, and incorporates no restrictions on the number of leaf nodes or samples per split, enabling more flexibility in capturing data patterns. This configuration allows the model to efficiently handle complex, high-dimensional data while minimizing prediction errors and maintaining generalization performance.
We found that compared with the original RF regression, the 20 most significant eigenvalues of the new model were changed, which further indicated that the signal peptide segments of Gram-negative bacteria were significantly different from the Gram signal peptide segments, and we had extremely important significance for the training of the new model.
Using the new model, we fitted it to the original data and observed that, while the overall trend was correct, some deviations remained. These findings are consistent with our initial signal peptide screening results, indicating that the original experimental selection was justified.
The final training outcomes of the dataset are as follows:
These metrics indicate that the model exhibits both high accuracy and predictive capability. The low MAE and RMSE values suggest that the model effectively minimizes prediction errors, while the relatively high R² demonstrates its substantial capacity to explain the variance within the data. Additionally, the Pearson Correlation Coefficient of 0.8275 indicates a strong positive correlation between the predicted and actual values. Taken together, these results validate the model as a robust tool for predicting E. coli protein secretion efficiency. This innovative approach significantly advances a relatively underexplored field, opening new avenues for the study and development of E. coli protein secretion models, while offering valuable insights for future research.
For the most outstanding signal peptides of origin model, we modified the H-zone based on the experimental and predictive data. Since the matrix encompasses diverse measurements, standardization is requisite to align them to a common scale. The standardized approach of the BLOSUM62 substitution matrix and hydrophobic matrix is devised in accordance with Formula as follow.
What's more, in Gram- bacteria, the specific gravity of Blosum 62 matrix can be 60% (w1 = 0.6) and the gravity of hydrophobic matrix can be 40% (w2 = 0.4).
where, aij, bij respectively represent the elements in the ith row and the j-th column of the standardized Blosum 62 matrix and hydrophobic matrix, accordingly, fij represent the elements in the comprehensive score matrix. We obtain a substitution score matrix as in Table 1.
Subsequently, we applied a comprehensive scoring matrix, combining the BLOSUM62 and hydrophobic matrices, as well as the Markov transfer matrix, to screen and generate feasible amino acid substitutions at each location. After the initial 10 signal peptides were generated, the experimental results were unsatisfactory, and then we generated 60 signal peptides and scored them using the first RF regression model. Among them, the modified signal peptides that exceeded the wild-type NFAA were chosen for experiments.
Due to the fact that the Markov transfer frequency matrix based on Gram-positive bacteria might not effectively generate the desired outcomes, and the RF regression model based on Gram-positive bacteria is less sensitive to the negative bacteria score of ±0.2, we optimized the model by using the generated small data set.
At the end of the experiment, we put the signal peptides with high scores in the original model but poor experimental performance into our latest signal peptide prediction model, and the trend is basically the same as the experimental results.