Model



Abstract

Plasmids are essential vectors in synthetic biology, facilitating the delivery of genetic material in bacterial systems. However, the generation of novel plasmid sequences with targeted functionalities presents a significant challenge, and no experimentally validated models exist for reliably designing biologically viable, de novo plasmids. We present a novel plasmid generation pipeline that integrates machine learning with experimental validation to produce functional plasmid sequences. Utilizing a custom Byte Pair Encoding tokenizer and the Mamba2 state-space model architecture, our model efficiently handles the long and complex nature of plasmid DNA. The model was trained in two phases: pre-training on a comprehensive dataset of over 137,000 plasmid sequences, followed by fine-tuning on specific target sequences. We generated and validated plasmid sequences in silico using sequence alignment tools and motif discovery techniques to identify valid origins of replication (oris) and antimicrobial proteins. In vitro experiments demonstrated that several generated plasmids successfully replicated in bacterial systems, confirming the functionality of the synthesized oris. We also identified promising antimicrobial proteins that are currently being tested.

Background

Plasmids are central to a wide array of biological applications. Their versatility in bacterial systems makes them an ideal vehicle for genetic material delivery. However, while researchers can manipulate and modify existing plasmids to accomplish useful functions, the generation of completely novel synthetic plasmid sequences with specific, targeted functionalities remains an open challenge. Currently, no experimentally validated models exist that can reliably generate de novo plasmid sequences that are biologically viable and functionally relevant. While sequence modelling tools have advanced considerably in recent years, we still lack a robust model to account for complex plasmid architecture.

Existing machine learning models, such as HyenaDNA and Evo, have demonstrated the potential for modelling and predicting biological sequences [1,2]. This approach has been useful in identifying novel sequences with desired traits. However, these models often face limitations when tasked with producing plasmid sequences that need to fulfill highly specific biological roles or interface with complex cellular machinery, such as the origin of replication (ori) regions, protein-binding motifs, and promoter elements.

Another limitation of current models is the lack of experimental wet lab validation. Most sequence generation models rely heavily on in silico metrics such as alignment scores and predicted structural features. While these methods are useful for narrowing down candidate sequences from initial model outputs, they do not guarantee that the generated plasmids will perform as expected in wet lab experiments. Consequently, the gap between computationally predicted sequences and experimentally confirmed plasmids remains wide. This underscores the need for a comprehensive pipeline that integrates computational generation, in silico validation, and rigorous in vitro experimental testing to iteratively refine models and ensure that generated plasmids are biologically relevant and functional. Given these challenges, the development of a model that can reliably generate plasmid sequences with experimentally validated potential for targeted functions remains an unmet need in the sequence model space.

Objectives

To meet this need, we work towards the following objectives:

  • Curate comprehensive and diverse datasets of plasmid sequences and motifs that are crucial for plasmid functionality, including ori sequences, promoter regions, and protein-binding motifs.
  • Leverage these datasets to generate complete plasmid sequences using state-of-the-art machine learning models.
  • Validate these computationally generated plasmids through both in silico methods, such as sequence alignment, motif discovery, and in vitro wet lab testing. Specifically, identify promising oris and antimicrobial proteins. This dual approach aims to iteratively refine our generative models, ultimately enhancing their accuracy and applicability for synthetic biology research.

Methods

Dataset Curation

Our natural bacterial plasmid dataset consists of 137,041 circular plasmid sequences derived from single origins. The sequences were curated from public databases like PLSDB and IMG/PR [3,4], and are filtered based on plasmid topology (e.g., circular vs linear) and source type.

Model Development

There are three main components of our workflow: sequence tokenization, model architecture and training, and model inference. This approach enables us to effectively process long plasmid sequences and generate novel, biologically plausible plasmids.

Sequence Representation and Tokenization

Most recent genomics modelling approaches adopt deep sequence models, which have seen success in natural language processing (NLP). Each nucleotide in a DNA sequence is considered as a token, analogous to a word in a sentence in the NLP context. This means that sequences can be extremely long. In our datasets, even the shorter plasmids can be 2-10K base pairs (bp) long. This presents a computational challenge for modern deep learning models, and some ingenuity is required to overcome this obstacle.

We adopt a tokenization strategy called SentencePiece [5] with Byte Pair Encoding (BPE) [6] trained on our dataset of plasmid sequences, inspired by the success of DNABERT-2 [7]. This tokenizer identifies and encodes frequently occurring subsequences, effectively compressing repetitive DNA patterns common in plasmids. After tokenizing a raw nucleotide sequence, each token represents a short k-mer (e.g., AAT) in the tokenizer’s dictionary. The use of BPE is crucial for managing long plasmids within computational constraints, allowing us to fit plasmid sequences up to 4x longer in the same context length.

In early experiments, we also found that nucleotide-level tokenization produced pathological models that tended to collapse towards long mononucleotide repeats during sampling. We note that this failure mode is well-known in NLP and one strategy for mitigating it is to apply a repetition penalty that discourages the model from repeating tokens. However, this seems ill-suited to the domain of genomics, where repetition is ubiqitous and important. Fortunately, this behaviour was not observed after switching to BPE tokenization.

Model Architecture and Training

For our model architecture, we use the Mamba2 model, a recent state-space model (SSM) architecture which has demonstrated success in long-sequence modelling tasks [8,9]. This choice was driven by Mamba's ability to handle long sequences efficiently, making it well-suited for plasmid data. The Mamba architecture employs a state space model with a selection mechanism, allowing more controlled propagation of information between layers. The computational complexity of the forward pass is linear in the sequence length, which is superior to the Transformer’s [10] quadratic complexity. These features allow it to overcome the limitations of traditional attention-based Transformers in handling extended sequences.

Our training objective follows the next-token prediction paradigm, similar to language model training. In this approach, the model learns to predict the next token in the sequence given the preceding tokens. This method enables the model to learn the underlying patterns and structures of plasmid sequences, capturing both local and global features of plasmid organization.

We implemented a two-phase training process to optimize the model's performance. The first phase, pretraining, utilized our entire dataset of plasmid sequences. This broad exposure allows the model to learn general plasmid characteristics and patterns, establishing a strong foundation for sequence generation. The second phase, fine-tuning, focused on a subset of data specific to our target organism and plasmid class. This specialized training enhances the model's ability to generate plasmids relevant to our specific research focus, improving the quality and relevance of the generated sequences.

We also use genomics-specific data augmentation to improve training and enhance generation quality. We perform reverse-complement (RC) augmentation, wherein we randomly change the input plasmid sequence to its reverse-complement with probability 0.5. Also, since plasmids are circular, choosing a particular linearization of one is somewhat arbitrary. To this end, we propose a random circular crop augmentation, in which we randomly shift the chosen “starting point” by some number of nucleotides. The RC augmentation allows the model to capture the relationship between a nucleotide sequence and its reverse complement, and the circular crop augmentation provides the model with the ability to handle input plasmid sequences regardless of the chosen “starting point.” Both of these augmentation strategies result in improved generalization and support targeted generation of plasmid sequences with particular properties.

This training approach enables our model to first capture general plasmid structures and then specialize in generating plasmids relevant to our specific research focus, striking a balance between generalizability and specificity. This is the paradigm in many cutting-edge machine learning models such as AlphaFold [11]. The fine-tuning process allows for the targeted generation of plasmids with desired characteristics, which demonstrates our model’s potential for genomics applications.

Model Inference

During inference, we generate novel plasmid sequences through a carefully designed process. We initiate sequence generation with a special "CLS" token, signaling the start of a new sequence. We then utilize nucleus sampling to predict the next tokens within the sequence, ultimately generating many plasmid sequences in parallel [12]. This is the sampling method of choice for many large language models which can also control the diversity of generated samples, allowing us to trade off sample quality with diversity.

The generation process then proceeds iteratively. The model progressively generates each subsequent token, with decisions based on the preceding sequence and learned patterns. This process continues until a complete plasmid sequence is produced or a predefined length is reached, allowing for the creation of plasmids of varying sizes.

Model In Silico Validation Pipeline

Overall Strategy

The validation pipeline assesses the reliability of a generated batch as a whole, identifies plasmid ORIs that appear promising, and selects antimicrobial proteins that could perform well experimentally.

Batch Summary Statistics

We performed a holistic assessment of the quality of each batch of sequences before starting a sequence by sequence validation pipeline. This was done both at a batch-wide level and individual sequence level. At the batch-wide level, we leveraged existing tools for genetic sequence clustering and annotation, particularly MOBsuite [13]., to gain deeper insight into the potential biological significance of the changes between and within batches. We examined whether the sequences in a given batch formed clusters, aligned well against known motifs, and had circular contigs (which are good criteria for predicting putative novel plasmids).

Firstly, single-linkage clustering by MASH distance [14] was performed on all the generated sequences. This revealed relations that exist within a batch of sequences. Next, to shed light on the relation between a batch and known plasmid sequences, we used BLAST to calculate the alignment score between sequences in our batch and known sequences of replicons, relaxase proteins and repetitive elements. We also filter by coverage and identity to ensure only the sequences with the highest alignment score get cataloged as putative plasmids. Finally, sequence size and GC content distribution were examined across our generated sequences. We would expect something close to a Gaussian distribution of GC content in a “good” batch.

Key Ori Features and Classes

Ori features and motifs. Oris require a complex and interdependent set of machinery to be present to replicate. Some of the most important regions or motifs are: Rep protein sites (Class A), AT-rich regions, DnaA boxes, and dam methylation sequence iteron (directly repeated sequences) containing regions [15]. The AT-rich region particularly is the most universally conserved structural element in both prokaryotic and eukaryotic replicons [15]. We compiled a database of major motifs found across the literature for these ori features.

Class A vs Class B plasmids – classification pipeline overview and results. Theta-replicating plasmids can be classified based on their replication initiation mechanisms. Of the many existing classes, this project mostly concerns Class A and Class B plasmids [16]. Class A theta plasmids rely on plasmid-encoded Replication proteins (Rep proteins) to initiate replication [16]. The replication of Class A plasmids is complex and rep proteins are not well documented. In contrast, Class B theta plasmids undergo a simpler replication mechanism. They do not require any Rep proteins and instead rely on host machinery for their replication initiation. Since only RNA II and a P2 promoter are required, Class B replication is easier to validate computationally as fewer sequence alignments can be used for identification.

Despite these known differences between Class A and Class B plasmids, there is no existing method in the literature that can classify an unknown plasmid’s sequence as Class A or Class B. As a result, despite the simplicity of Class B plasmid replication, it is still difficult to accurately classify unknown plasmids through only in silico methods.

We developed a preliminary script that aligns each generated plasmid sequence to a set of known class B elements. The script sequentially aligned each generated plasmid sequence to the four following elements: P2 Promoter, RNA II Coding Region, R-loop Formation Site (substituted by GC content analysis due to R-loops having G/C rich nature), and the terH consensus sequence. The alignment scores for all four components were mostly negative across the majority of the plasmids. Only a very small subset of sequences showed positive scores for P2 promoter alignment, which corresponded to the shortened sequences in the datasets. These negative scores suggest poor alignment between the generated sequences and the class B elements.

These negative scores are likely due to a lack of sequence variability in the class B elements that were used. For example, only one version of the P2 promoter and RNA II region (From E. coli ColE1 plasmid) were used, limiting the accuracy of the alignments. Using a set of variants for each sequence could improve alignment scores. However, the lack of established consequence sequences for a P2 promoter and RNA II coding region make this a difficult next step to pursue.

Ori Alignment Pipeline

The first step in our generated sequence alignment pipeline was to align our generated sequences to an ori database and select those generated sequences with a high but less than 100% alignment to a wildtype ori from the doriC [17] database. MMseqs2 [18] and BLAST+ [19] were used to align the sequences. These generated sequences were then passed into the annotation phase of our pipeline.

Ori Annotation

We compiled a database of major motifs found across literature for important ori features. To identify replication proteins corresponding to each ori, we first found motifs within the generated plasmid ori sequences and then determined if any replication proteins bind to these motifs. For motif discovery, we used MEME Suite 5.5.5's MEME tool [20]. To identify class B plasmids, sequences were both put through our class B alignment pipeline (discussed in “Key Ori Features and Classes”) and manually examined. Sequences that aligned to known class B plasmids were prioritized.

Protein Validation - Antimicrobial Resistance

Since our generative model was trained on whole-plasmid sequences containing a large collection of wild type protein sequences, the underlying assumption was that proteins generated from the learned distribution of functional proteins should also be functional. One significant challenge is the unannotated whole plasmid sequence on which the model was trained, meaning the model must be able to learn from a sample space not constrained to near known functional wildtype proteins, as well as not being provided with protein structural priors [21]. Below are some of the common concerns associated with generated protein sequences and the associated computational evaluation metrics we used to address these issues.

Typically, protein sequences are detected and validated in silico by comparing the distribution of generated sequences to natural controls using alignment-derived scores. We experimented with commonly used homology search techniques such as BLAST and MMseqs2 against the 2023 Comprehensive Antibiotic Resistance Database [22] to search for antimicrobial resistance proteins (AMR). However, there were two problems with this approach. First, the homology alignment did not take into account the issue of overfitting commonly seen in generative models, in which the model produces sequences that mimic too closely those seen in the training set. Second, the relatively high threshold of 75% similarity matches would leave out potentially viable proteins that can attest to the model’s ability to produce novel sequences not previously seen in existing databases. Thus, it was clear that the traditional protein validation approach was not suitable for evaluating performance of generative models.

To address this issue, we used alignment-based techniques with significantly more relaxed thresholds and on a more diverse family of proteins to identify potential sequences as testing samples. First, in terms of threshold, a lower percentage of coverage to a reference length of 20% is applied, and the selected open reading frames (ORFs) are extended by a flanking sequence of 100 bp length along the 5’ direction of the coding sequence. The cutoff of 20% was chosen to benchmark our model against a current state-of-the-art large language protein generation model, ProGen, as it is the lowest percentage sequence identity of ProGen’s generated proteins [21]. This lower score also allows for the inclusion of artificial sequences that diverge from natural sequences. Second, using NCBI’s AMRFinderPlus database [23], we used BLAST against both the “core” and “plus” proteins collection, as well as filtered for genes specific to E. coli to ensure consistency with plasmid design. The “core” and “plus” protein collection includes not only AMR, but also stress response factors (biocide, metal, heat resistance), virulence factors, antigens and porins. Overall, these design choices contribute to a protein validation pipeline with significantly higher granularity.

Due to the text-generation mechanism of language modeling that our model is based on, the generated proteins are sensitive to pathogenic single-nucleotide variations, leading to many potential factors that can lead to poor protein expression and activity in in vitro experimentation. This may include missense mutations that disrupt protein folding and stability, or nonsense mutations causing stop codon formation and hinder expression. Thus, it is important to use structure-based tools for early detection of these issues and to avoid passing any unfit candidates to wet-lab validation as possible. In particular, we use the Transformer-based protein-folding predictive model ESMFold [24], most importantly for its ability to accurately predict structures for orphan proteins with limited sequence homologs. We used four metrics for assessing accuracy and reliability of predicted protein models: Predicted Local Distance Difference Test (plDDT), Predicted Aligned Error (PAE), Plot of Contacts from Structure Module, and the predicted template modeling score (pTM-score). Overall, regions with high pLDDT/pTM scores and low PAE values can be considered potentially viable for downstream applications. Additional information about these protein metrics is provided in the appendix.

An end-to-end preliminary implementation of the protein validation pipeline with the outlined metrics is included in our released software package. Given a collection of generated DNA sequences, the pipeline performs high-throughput sequencing, protein annotation, and 3D structure prediction. Upon completion, the program returns the predicted protein’s DNA and amino acid sequences, associated scores and visualization for evaluation purposes.

Ori Batch Results

Three major batches of sequences were generated and validated. Other batches (such as attempts with diffusion models) were produced but did not show sufficient promise to make it to the validation pipeline.

Batch 1

The first batch of plasmids was generated using a Mamba model trained on a smaller dataset of less than 50,000 plasmids. This initial model used nucleotide-level tokenization, which limited its ability to capture longer-range dependencies in the plasmid sequences. While this approach provided a baseline, it had limitations in generating biologically plausible plasmids. The batch-wide analysis pipeline produced the following results:

The sequence size of this batch showed wide variation. Its GC content distribution is slightly skewed towards the lower end, and it has a relatively small standard deviation - indeed, the smallest one from all the batches. Furthermore, no putative plasmids were identified; hence, no clusters were produced. No closed local alignment results were produced for relaxase proteins either. Luckily, there were some promising results: MOBsuite found sequences that closely aligned to specific clusters of Rep proteins:

This encouraged us to dive deeper into the batch and select top sequences for in vitro validation. An initial set of 6 oris, a mix of Class A and Class B was selected. However, after attempting to design these plasmids in Wet Lab, Class A plasmids were deemed to be too difficult to test at this stage, as their machinery is more complex and difficult to detect. This Wet Lab - Dry Lab iteration meant that instead 5 promising Class B oris were delivered to Wet Lab. Finally, one promising protein was also selected. None of the oris in this batch worked.


Batch 2

For the second batch, we transitioned to the Mamba2 architecture and implemented a Byte Pair Encoding (BPE) tokenizer. This allowed us to scale the context length to encompass more nucleotides effectively. We also significantly expanded our training dataset. Additionally, we introduced data augmentation techniques, including random circular permutations and the inclusion of sub-plasmid sequences of varying lengths. These augmentations helped the model better capture the circular nature of plasmids and their modular structure.

The batch-wide analysis and statistics showed more promising results:

There was a very significant change in the distribution of size and GC content with respect to the previous batch. We observed simultaneously much smaller variance in the length of the generated sequences and a much higher variance in their GC content. Furthermore, this time MOBsuite did identify clusters of putative plasmids within the produced contigs. It also found close alignments to known clusters of rep and relaxase proteins (notably, most of the alignments came from the putative plasmid clusters):

These encouraging in silico results were matched by an equally encouraging in vitro validation: 8 Class B oris and 5 proteins were selected from the validation pipeline. At least 6 oris from this batch have shown promising results as of the time of writing.


Batch 3

For the third batch, we focused on exploring the model's capability to generate plasmids with specific, user-defined characteristics. As an initial experiment in this direction, we attempted to condition the model to produce plasmids with a target length of approximately 2048 bp. We also further refined our training dataset and fine-tuning stage using results of the wet lab experiments.

The initial statistical analysis of this batch strongly suggests that our conclusions about the (at least locally) optimal distribution of the physical properties among the generated sequences: The variance in sequence size was even smaller, with practically all sequences being around 2000 bp long. The GC distribution also showed a smaller variance, and the slight tri-modal pattern in Batch 2 disappeared almost completely.

We identified even more putative plasmid clusters than for Batch 2. Furthermore, the local alignment analysis showed identified many more matches with known relaxase and rep proteins: 9 oris were selected for experimental validation. These oris have not been fully tested in vitro yet.

Protein Selection

Most of the DNA sequences from batch 3 that align with the AMR database are too short to form a viable protein (less than 300mAA) when blasted for ORFs for amino acid sequence translation. Out of a batch of 10,000 sequences, 140 DNA sequences were found to have aligned with the AMRFinderPlus database above the minimum 20% coverage to reference sequences, and only 6 translated amino acid sequences were predicted to form viable 3D protein structures.

One good indicator of the model performance is the lack of overfitting. As seen in the percentage coverage of reference length, half of the selected predicted proteins are only ~50% in comparison to wild type. If these proteins perform as well in wet lab experimentation as ESMFold predicts, then we can infer that the model is capable of producing viable proteins. Overall, 6 proteins were selected for testing.

The selected generated proteins below are listed in descending order of (1) predicted template modeling score (pTM) and Predicted Local Distance Difference Test (plDDT) score.

Seq5905_860-2045: tet(A) tetracycline efflux MFS transporter

Seq7980_197-1130: aph(6)-Id aminoglycoside O-phosphotransferase

Seq7293_1119-2046: aph(6)-Id aminoglycoside O-phosphotransferase

Seq5572_1-2046: sigA serine protease autotransporter toxin

Seq3558_1-2007: sigA serine protease autotransporter toxin

Seq1712: P25733, CfaC, CFA/I fimbrial subunit C

Most Valuable Plasmid

The most promising plasmid generated was Seq741, a 2026 bp plasmid with a 779 bp long ori sequence of Class B origin containing RNAI and RNAII molecules. The plasmid also contained a highly promising generated protein. The plasmid showed promising results when tested in vitro and the ori was functional (further discussed in Wet Lab results).

Discussion

The results of our plasmid generation model demonstrate both the potential and limitations of combining computational approaches with experimental validation in synthetic biology. Our tokenizer proved to be highly effective in improving our training metrics and retaining biological relevance. Our validation pipeline, particularly in ori alignment, allows for nuance by searching beyond high alignment metrics for the presence of important motifs. Nevertheless, the predominantly negative alignment scores for the Class B plasmid elements highlight the complexity of generating biologically functional plasmids solely through computational methods. These results suggest that while our model captures general plasmid structures, there are gaps in replicating the nuanced sequence variations required for specific functional elements. This underscores the need for more diverse reference sequences and improved fine-tuning of the model to capture subtle sequence motifs. The iterative wet lab-dry lab collaboration has proven invaluable in this regard; initial dry lab predictions identified promising sequences, which were then refined based on wet lab feedback. For instance, the decision to focus on Class B plasmids for wet lab testing was driven by their simpler replication machinery, which is easier to validate experimentally compared to Class A plasmids with more complex requirements. This collaborative cycle between computational generation and experimental validation not only enhances the accuracy of generated plasmids but also accelerates the overall engineering cycle. Some key limitations and next steps are discussed below.

Limitations

Limited Data For Model Training

The scarcity and complexity of plasmid sequence data present significant challenges for machine learning models like ours. Plasmids are typically long sequences, but relatively few have been fully sequenced, resulting in a limited dataset for training. Moreover, their modular nature, consisting of various functional elements combined in different ways, further complicates the learning process. To address these limitations, we implemented several data augmentation techniques. We employed random circular permutations to leverage the circular nature of plasmids and to increase sequence diversity. Additionally, we included sub plasmid sequences of varying lengths, effectively breaking down the complex structures into more manageable segments. These strategies not only expanded our training dataset but also helped our model better capture the modular and circular characteristics of plasmids, improving its ability to generate biologically plausible sequences despite the initial data constraints.

Motif Annotation and Discovery

There is limited data available on relevant motifs. Particularly, protein binding motifs are difficult to annotate. MEME only discovers motifs but it does not determine if any of the discovered motifs are known binding sites. We can use MEME Suite’s FIMO tool [25] to scan the sequences for occurrences of the discovered motifs and compare them against known binding site databases to identify potential binding sites; however, this requires supplying a sequence of known binding site motifs. Currently, we are unable to find a database of replication protein binding motifs. To overcome this limitation, in the current pipeline, we prioritize features that are more likely to be conserved, such as RNA II molecules and AT-rich regions.

Complex Ori Machinery

The ori is an intricate and interdependent network of smaller, essential components, all of which must be both present and functionally matched for successful plasmid replication. Because wildtype plasmids already possess these configurations, much of the intricacies involved in how these elements interact remain poorly studied. The challenge lies in the relational nature of these components—each element is not just independently necessary but must also coordinate with others in a specific spatial and temporal context. This complexity makes it difficult to experimentally disentangle the contributions of individual motifs, further compounded by a lack of comprehensive datasets that characterize these interactions in diverse bacterial systems.

Relational Challenge

The relational challenge is especially problematic when attempting to generate or engineer new plasmid oris through computational models. The necessity for these components to be contextually correct, not just individually present, creates difficulties in ensuring functional replication in synthetic constructs. Despite these challenges, understanding and replicating these complex interdependencies is key to designing new plasmids with predictable behavior, and more research is needed to systematically explore how these components co-evolve and function together.

Next Steps

Building upon our current progress, our next step in plasmid generation will focus on developing a more context-aware model that can better capture the intricate relationships between different plasmid elements. We envision implementing a hierarchical generation approach, where the model first learns to generate a high-level plasmid architecture, including the relative positions and sizes of key functional elements such as ori regions, coding sequences, and regulatory elements. This skeletal structure would then guide the generation of specific sequences for each element, ensuring better coherence and biological plausibility. Additionally, we plan to incorporate attention mechanisms that allow the model to explicitly consider long-range dependencies within the plasmid sequence, potentially improving the generation of complex regulatory networks and compatible functional elements. This advanced generation strategy aims to produce more holistic and functionally viable plasmid designs, bridging the gap between computational predictions and biological reality.

Developing more sophisticated alignment and motif discovery algorithms, potentially incorporating machine learning-based sequence alignment methods, would enhance the accuracy of in silico validation, especially for distinguishing between complex plasmid types. Expanding the in silico validation pipeline to include more comprehensive functional assays, such as simulation of plasmid replication and expression dynamics, would provide deeper insights into the biological relevance of generated sequences. Eventually, we hope to validate entire plasmids beyond oris, and beyond this, other sequences such as bacterial chromosomes and phage genomes.

Conclusion

The development of our plasmid generation model highlights several critical advancements and challenges in the field of synthetic biology. Our use of BPE for tokenizing plasmid sequences has proven effective in compressing extensive genetic information while preserving biological relevance, enabling efficient processing of large plasmids. The Mamba model’s state-space architecture further supports this by handling long-range dependencies more effectively than traditional Transformer models, thereby capturing both local and global features essential for accurate plasmid generation. Our two-phase training process—pretraining on a broad dataset and fine-tuning on specific plasmid classes—strikes a balance between generalizability and specificity, enhancing the model’s ability to generate biologically plausible sequences tailored to specific research needs. However, challenges in the in silico validation process, particularly in the classification and alignment of Class A and Class B plasmids, reveal the need for more diverse training data and improved sequence alignment techniques. Preliminary results indicate that while Class B plasmid generation shows potential, further refinement in motif discovery and alignment pipelines is required to increase the biological validity and experimental relevance of the generated plasmids. Our iterative approach, combining computational generation with wet lab validation, enables a strong foundational solution for generating novel bioloigcally relevant plasmids.

References

  1. Nguyen, E. et al. Sequence Modeling and Design from Molecular to Genome Scale with Evo. (Cold Spring Harbor Laboratory, 2024).
  2. Nguyen, E. et al. HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794 (2023).
  3. Schmartz, G. P. et al. PLSDB: Advancing a comprehensive database of bacterial plasmids. Nucleic Acids Research 50, D273–D278 (2021).
  4. Camargo, A. P. et al. IMG/PR: A database of plasmids from genomes and metagenomes with rich annotations and metadata. Nucleic Acids Research 52, D164–D173 (2024).
  5. Kudo, T. & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. in Proceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations (eds. Blanco, E. & Lu, W.) 66–71 (Association for Computational Linguistics, Brussels, Belgium, 2018). doi:10.18653/v1/D18-2012.
  6. Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. in Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (eds. Erk, K. & Smith, N. A.) 1715–1725 (Association for Computational Linguistics, Berlin, Germany, 2016). doi:10.18653/v1/P16-1162.
  7. Zhou, Z. et al. DNABERT-2: Efficient foundation model and benchmark for multi-species genomes. in The twelfth international conference on learning representations (2024).
  8. Dao, T. & Gu, A. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. in Forty-first international conference on machine learning (2024).
  9. Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. (2024).
  10. Vaswani, A. et al. Attention is all you need. in Advances in neural information processing systems (eds. Guyon, I. et al.) vol. 30 (Curran Associates, Inc., 2017).
  11. Nature Methods 20, 163–163 (2023).
  12. Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. in International conference on learning representations (2020).
  13. Robertson, J. & Nash, J. H. E. MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microbial Genomics 4, (2018).
  14. Ondov, B. D. et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 1–14 (2016).
  15. Rajewska, M., Wegrzyn, K. & Konieczny, I. AT-rich region and repeated sequences – the essential elements of replication origins of bacterial replicons. FEMS Microbiology Reviews 36, 408–434 (2012).
  16. Lilly, J. & Camps, M. Mechanisms of theta plasmid replication. Microbiology Spectrum 3, (2015).
  17. Dong, M.-J., Luo, H. & Gao, F. DoriC 12.0: An updated database of replication origins in both complete and draft prokaryotic genomes. Nucleic Acids Research 51, D117–D120 (2022).
  18. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35, 1026–1028 (2017).
  19. Camacho, C. et al. BLAST+: Architecture and applications. BMC Bioinformatics 10, 1–9 (2009).
  20. Bailey, T. L., Johnson, J., Grant, C. E. & Noble, W. S. The MEME suite. Nucleic Acids Research 43, W39–49 (2015).
  21. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnology 41, 1099–1106 (2023).
  22. Alcock, B. P. et al. CARD 2023: Expanded curation, support for machine learning, and resistome prediction at the comprehensive antibiotic resistance database. Nucleic Acids Research 51, D690–D699 (2022).
  23. Feldgarden, M. et al. AMRFinderPlus and the reference gene catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Scientific Reports 11, 1–9 (2021).
  24. Lin, Z. et al. Evolutionary-Scale Prediction of Atomic Level Protein Structure with a Language Model. (Cold Spring Harbor Laboratory, 2022).
  25. Bailey, T. L., Johnson, J., Grant, C. E. & Noble, W. S. The MEME suite. Nucleic Acids Research 43, W39–W49 (2015).

Appendix



Training Details

The training architecture for the plasmid generation model leverages the Mamba2 state-space model (SSM) to handle the long-range dependencies in plasmid sequences. This section details the model structure, hyperparameters, and training strategy used to generate biologically relevant plasmids.

Model Architecture

Component Description
Base Architecture Mamba2 state-space model (SSM)
Number of Layers 22
Hidden Dimension 512
SSM Configuration
  • Layer: Mamba2
  • d_state: 64
Normalization Type LayerNorm
Residual Computation FP32
Fused Add Norm Enabled

Hyperparameters

Hyperparameter Value
Optimizer AdamW
Betas (0.9, 0.95)
Weight Decay 0.1
Learning Rate Scheduler 10% Linear Warmup, 90% Cosine Decay
Scheduler Span 100,000 steps
Minimum Learning Rate 1e-4
Maximum Learning Rate 4e-3

Hardware and Training Configuration

Parameter Description
Nodes 1
GPUs 2 per node (NVIDIA A100 - 40GB)
Memory 40 GB
Precision Mixed (bf16)
CPUs 12 (6 per task)
SLURM Configuration
  • Scheduler: SLURM
  • Time: 36 hours
  • Checkpointing: Enabled
  • Job Output: Logs directory

Sampling and Inference Strategy

The model uses a combination of advanced sampling techniques to control sequence quality:

  • Top-k Sampling: Retains the top k most probable tokens (disabled by default).
  • Top-p Sampling: Uses nucleus sampling to maintain a cumulative probability of 0.9.
  • Repetition Penalty: Applies penalties to repeated tokens to reduce redundancy.
  • Temperature: Set to 1.0 to control randomness during sampling.

Diffusion Model

We initially explored using a diffusion model to facilitate plasmid generation. Diffusion models are a class of deep generative models which operate by iteratively adding noise to input data over many steps, and then learning to denoise the corrupted data back into the original data [1,2]. In our context, this would allow the generation of a plasmid from random noise at inference time. In particular, we attempted an approach called discrete diffusion, a type of diffusion model designed to operate on discrete data [3,4]. However, we found that this type of modeling was not well-suited to our task and abandoned this line of work.

Protein Validation Metrics

Below we outlined four metrics based on ESMFold output for assessing accuracy and reliability of predicted protein models.

  • Predicted Local Distance Difference Test (plDDT) score is a per-residue confidence metric ranging from 0 to 100, reflecting the model's confidence in the positioning of each amino acid. Higher pLDDT scores indicate greater confidence; for instance, residues with scores above 90 are considered highly reliable and well-resolved, while those below 50 suggest low confidence and may correspond to disordered or flexible regions.
  • Predicted Aligned Error (PAE) provides insight into the expected positional error between pairs of residues in the predicted structure. It estimates the uncertainty in the relative positioning of different regions within the protein. For example, a low PAE between two domains suggests a confident prediction of their spatial relationship, whereas a high PAE indicates potential variability, signaling that these domains might adopt multiple conformations or are flexible relative to each other.
  • Plot of Contacts from Structure Module refer to predicted interactions between amino acid residues based on their proximity in the three-dimensional structure. Analyzing these contacts helps validate key intra-protein interactions essential for structural integrity and function. For instance, if a predicted active site shows consistent contact patterns between catalytic residues, it reinforces confidence in the model's functional relevance. Conversely, missing expected contacts might suggest regions where the model is less reliable or where alternative conformations exist.
  • The predicted template modeling score (pTM-score) is a metric for assessing the similarity of protein structures. A score of 1.0 indicates a perfect match. Scores above 0.7 indicate that proteins share the same backbone structure. Scores above 0.9 indicate that the proteins are functionally interchangeable for downstream use.

References

  1. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. in Proceedings of the 32nd international conference on machine learning (eds. Bach, F. & Blei, D.) vol. 37 2256–2265 (PMLR, Lille, France, 2015).
  2. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. in Advances in neural information processing systems (eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) vol. 33 6840–6851 (Curran Associates, Inc., 2020).
  3. Austin, J., Johnson, D. D., Ho, J., Tarlow, D. & Berg, R. van den. Structured denoising diffusion models in discrete state-spaces. in Advances in neural information processing systems (eds. Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S. & Vaughan, J. W.) vol. 34 17981–17993 (Curran Associates, Inc., 2021).
  4. Hoogeboom, E. et al. Autoregressive diffusion models. in International conference on learning representations (2022).