Engineering Cycle for OriV Creation

The engineering cycle is a crucial aspect of our team’s journey towards the de novo synthesis of plasmid components. Multiple sub-teams worked together in constant feedback to facilitate optimal designing, building, and testing of components and workflows, as well as learning from each project iteration.

Below, we detail the iterative process the dry lab and wet lab teams took to create and validate novel oriVs, highlighting various feedback loops that cumulatively resulted in a thorough design-build-test-learn (DBTL) cycle.

Fig 1. DBTL cycle for oriV creation and validation, including feedback loops.

Generating and validating novel oriVs

Batch 1

Design.	The first batch of plasmids was designed using a Mamba model trained on a smaller dataset of less than 50,000 plasmids. This initial model used nucleotide-level tokenization. We also designed and used our preliminary validation pipeline, which did not distinguish between different classes of oriVs.
Build.	A simple assay was developed to test these plasmids. Using golden gate assembly, the generated oriVs were cloned into a plasmid with an antibiotic resistance gene (ampicillin).
Test.	The plasmids assembled with the generated sequences were transformed and plated. Other than the positive control, no colonies were observed, indicating that the Batch 1 sequences are non-functional.
Learn.	Our nucleotide-level tokenization limited our ability to capture longer-range dependencies in the plasmid sequences, which resulted in worse-quality sequences. This was affirmed by the fact that none of the sequences worked in the lab. We learned that a tokenizer to compactly process the training dataset could be highly beneficial. We also learned that Class B plasmids were much easier to test experimentally than Class A, which showed that we needed to design an improved validation pipeline.

Batch 2

Design.	For the second batch, we designed our model with the Mamba2 architecture and implemented a novel Byte Pair Encoding (BPE) tokenizer. This allowed us to scale the context length to encompass more nucleotides effectively. We also significantly expanded our training dataset. Additionally, we introduced data augmentation techniques, including random circular permutations and the inclusion of sub-plasmid sequences of varying lengths. We developed a new strategy for classifying Class A vs Class B oriVs based on the learnings in Batch 1 by pursuing further alignment against common Class B elements.
Build.	In the same fashion as Batch 1, the generated oriV sequences were assembled into a plasmid with the ampicillin resistance gene. No changes were made to the plasmid design this round.
Test.	After transformation into E. coli and plating, five samples had functional oriVs. The samples were validated by whole plasmid sequencing (741, 4089, 5276, 5727, 9371). Interestingly, one of the samples, 6769, produced colonies, albeit very small. Sequencing failed to read this sample, meaning that we can't make any conclusions. After reviewing the sequence, it was found that the ori sequences lacked a promoter driving the critical RNAII component. The formation of the colonies could have been due to the ampicillin gene promoter leaking over to drive the ori sequence.
Learn.	Four sequences produced successful results in the lab. Thus, we learned that our new model design and augmentations helped the model better capture the circular nature of plasmids and their modular structure. By selecting a few specific Class B wild-type plasmids as our references, we increased the throughput of the model successfully. As for the wet lab validation, the plasmid was redesigned to assemble the oriV sequence to be running in the opposite direction as the ampicillin resistance gene. This ensures expression of the oriV components is not due to a leaky terminator.

Batch 3

Design.	We focused on exploring the model's capability to generate plasmids with specific, user-defined characteristics. As an initial experiment in this direction, we attempted to condition the model to produce plasmids with a target length of approximately 2048 base pairs. We also expanded our validation pipeline to improve its protein identification capabilities. By relaxing our threshold for alignment, we allowed for more diverse families of proteins to get identified and reduced the possibility of solely selecting overfitted proteins.
Build.	This iteration of plasmids was designed to assemble the antibiotic resistance gene to be expressed in the opposite orientation as the oriV sequence. This ensures that the promoter of the resistance gene cannot lead to oriV expression.
Test.	The experiments are currently in progress.
Learn.	We saw improvements in overfitting, particularly in proteins, as half of the selected predicted proteins are only ~50% in comparison to wild-type proteins.