Limnospira Fusiformis is a difficult to transform cyanobacteria known for its nutritional content, which our team hoped to leverage to produce
ingredients of premium baby formula. Even as we pivoted to differing strains such as UTEX 2973, 3154, and PCC 7942, the problem of transforming polyploid cyanobacteria
proved difficult and time-intensive to solve. To make polyploid cyanobacteria genetically tractable, we needed to overcome the inherent Restriction Modification systems. To overcome this immune system, we developed BLACKBIRD to utilize the STEALTH algorithm and remove possible RMs targetted sights from our gene
insert as well adapting the insert to the target’s codon bias. Below we detail the Design-Build-Test-Learn process we undertook on the program to ensure its viability
in engineering.
Swipe down on each DBTL to see full Design-Build-Test-Learn cycle
Design: The tentative designs of the BLACKBIRDCoOp algorithm aimed to develop an application of the Stealth algorithm, to overcome the hurdles of the Restriction
Modification System (RMS). Initially, using the L. Fusiformis genome, the program was to use Stealth to find all potential RMS cut sites, and excise them from the gene insert
provided. We were provided the Stealth outputs for L. Fusiformis and UTEX 2973, which we piped into our program. As we were trained in Python from our coursework, we decided to
implement our solution in it. The pipeline generated operated as such:
Step
Input
Output
Insert translation
FASTA sequence
Python string
RMS site detection
Python string, List of Strings
List of List Indices
RMS site removal
Python string, List of List Indices, List of Strings
Python string
Build: The first iteration of the BLACKBIRDCoOp program set the groundwork for the rest of the iterations. In order to create an automated pipeline,
a FASTA file reader was implemented to read gene inserts provided to BLACKBIRDCoOp, as well as a program to read the Stealth results. After they were parsed,
every match from the Stealth file found in the gene insert recorded its location and the sequence in question. At the final stage of editing the gene insert,
we elected to generate a sequence with a random alternate codon based on the desired amino acid. As multiple sequences would be generated from this approach,
we would decide on a sequence to return based on ranking the generated sequences by least number of Stealth hits and percent identity.
Test: The first run of testing utilized the gene sequence of the EGFP protein for BLACKBIRDCoOp, as it would be used as a screening marker
in our wet lab experiments. This testing cycle indicated various shortcomings of our initial prototype that helped us refine our goals in future cycles.
Our implementation of the RMS site removal operated on generating alternate sequences for each Stealth site, saving them to be ranked later.
We realized this would require us to create a program to identify the relevant codon to be altered, which we implemented by finding the codon which contained
the first nucleotide of the cut site.
After an alternate sequence was generated with random codon matching, the sequence was searched for the number of remaining sites, dubbed 'Stealth Scores’,
and sorted based on which sequence had the lowest. We additionally calculated the percent identity compared to the original gene insert sequence, to be
utilized down the pipeline as another sorting method.
The output of BLACKBIRD containing a list of ranked sequences based on the number of RMS sites found within it, indicated at the end of each sequence with a negative number.
Learn: We recognized quickly that using random alternative codons would not be a viable solution in optimizing inserts to be more efficient,
as too many Stealth sites remained in the sequence. Additional issues were discovered as the code outputted sequences of varying lengths, chalked up to errors
from the random generation. We could not find a worthwhile way to integrate the percent identity to the original sequence as a ranking method, so we decided to
discard this. The software development process required us to create new functionality we didn’t foresee in the Design period, such as the STEALTHParser module
to convert Stealth outputs from IUPAC and the codon finder for edits. We attempted to delve more into the design process to anticipate developments like this going forward.
Design: During the second cycle, we added a second member to the dry lab team and had more time to design the process. We decided on new modules to be
part of the pipeline, which were updated to include the parsing module. The newest iteration introduced a codon bias calculator for the target genome to optimize
the gene insert for our target organism. Below is the updated pipeline:
Step
Input
Output
FASTAReader
FASTA file
Python String
STEALTHParser
TXT file
List of Strings
codonUsage
Python String
Dictionary of List of Lists
StealthHits
Python String, List of Strings
List of Lists
codonChoice
List of Lists, Python String, Dictionary of List of Lists
Python String
Build: Various new Python modules and updated names were added to the BLACKBIRDCoOp build. The FASTA parser was named FASTAReader, and a STEALTHParser module
was introduced; it took the input of Stealth sites and converted the nomenclature from IUPAC standards. The codonUsage module derived a codon usage table from the whole
genome file the program parsed. A StealthHits function was coded that searched through and found every instance of a Stealth hit in the gene insert, and finally,
the codonChoice replaced the codons of the Stealth hit with the most biased codon from the codon usage table. This version did not require multiple sequences to be
stored and ranked. Concerns were raised about maintaining variation in future versions of the code, as it would stochastically improve the transformation efficiency
of the sequence. We additionally began working on a github repository for the dry lab team to use throughout. This was separate from the 2024 iGEM GitLab, which was
activated later in the process.
Test: Testing the new version of BLACKBIRDCoOp found improvements that stuck into later versions and were built on in following cycles.
The respective module successfully generated the codon usage tables; however, we faced hurdles when referencing it for changes. The codonChoice module
occasionally struggled to generate alternate sequences, specifically for Alanine, which it often substituted instead of the correct amino acid. The output
returned only a single sequence, but it still contained a Stealth score that was too high.
The output of codonUsage, that shows the codon use and bias of the genome given to BLACKBIRD.
Learn: We discovered in this iteration that we needed to create a new methodology for editing the gene insert. The errors caused by the codonChoice
module were likely indicative of other issues with the gene insert modification process. We were posed with two new hurdles: the existence of rare codons and
the mismatch between the host genome from which the gene insert was derived and the target genome. Another problem arose where the program possibly created new
RMS sites with every new addition. We began work on a solution to the codon bias problem as the issue of introducing Stealth sites was brainstormed.
Design: The third cycle of the BLACKBIRDCoOp design process was to rework the codon bias tables and the codon selection process, as well as
added domestication protocols to fit the iGEM standard for parts. The pipeline was changed to run the codon bias calculations on both genomes, and the
codonChoice module was altered to fit this. Below is the pipeline:
Step
Input
Output
FASTAReader
FASTA file
Python String
STEALTHParser
TXT file
List of Strings
codonUsage
Python String
Dictionary of List of Lists
StealthHits
Python String, List of Strings
List of Lists
codonChoice
List of Lists, Python String, Dictionary of List of Lists
Python String
Build: The main change added to this build of the program was the integration of the codon bias ranking system into the codonUsage module.
The module would additionally create a ranking that matched the most used codons for an amino acid in the host genome to the same in the target genome.
This would replicate the conditions of the host organism within the target organism, as the gene insert from the host would be adapted to the target’s codon biases,
accounting for rare codons. [2][3]
Additionally, we researched the best ways to adapt to iGEM part registry’s standards as well as for our Golden Gate experiments we would run. A simple solution for
removing Type IIs enzyme sites as well as Golden Gate cut sites was to add them manually to our Stealth Output list; the sequences contained within the output list
were to be removed, and utilizing this eased our workload.
Test: The alterations to the codonChoices module led to successful modifications to our gene insert of choice. We no longer had errors that led to
mistranslating codons as we did before. However, our alternative sequences were still unfit for transformation with their high Stealth scores.
Learn: As the Stealth scores for our sequence remained high, we investigated further and determined that the main cause of these were Stealth sites
generated at the sites of previous Stealth site alterations. As we developed the code from this DBTL cycle, we had been designing the mechanism for changing
Stealth sites while preventing the generation of any new sequences. In addition to this we had considered our solution for domestication, and concluded that
this would require a more professional solution in time, as the current solution would not be functional in a package workflow.
Design: As we worked on redesigning the editing process of the gene insert, we reconsidered and returned to the drawing board for methods on how to edit
the sequence. The current approach of selecting an optimal codon has resulted in a loss of variation within the gene insert. Meeting with our PI Dr. David L. Bernick,
we learned how variability in our insert would overall make it stronger, and implementing a stochastic process to develop gene inserts would lead to higher
transformation efficiency. For this we began down two design paths concurrently. We designed an editing window approach, which would function by generating a
five codon window at the start of each Stealth site. This would allow newer sites generated to be caught easier. In previous manual testing, we discovered certain
Stealth sites to be immutable with the codons usable, and thus created a forfeit mechanism in which the program would move to the next site if a Stealth site could
not be removed from the current window. This functionality was added to the codonChoice module. To address the issues of variation, an alternate strategy discussed
was the development of a weighted random assignment for the codon replacements.
Step
Input
Output
FASTAReader
FASTA file
Python String
STEALTHParser
TXT file
List of Strings
codonUsage
Python String
Dictionary of List of Lists
StealthHits
Python String, List of Strings
List of Lists
codonChoice
List of Lists, Python String, Dictionary of List of Lists
Python String
Build:The first step of the new updates required the creation of the editing window that would allow for the detection of newly generated sites with each edit.
The window was set at five codons, as a Stealth site 4 to 6 nucleotides long would be caught by a search in the window. The program would place the starting codon of
the Stealth site as the third codon in the window. Edits would be applied to this codon, and the window would be searched for any newly generated sites. If a site
persisted after using a codon with the matching rank, the next in the ranking would be attempted, and so forth until every codon was attempted. If the site persisted,
the window would shift to the next codon and attempt the cycle again. Lastly, if none of the edits successfully removed the site, the sequence would continue to the
next site leaving the site as it was found.
A secondary development was the random weighing program. These would randomly assign a codon as before, but utilize a weighted random that would convert the
genome’s codon bias into a representative weight with which to choose the codon. The percent usage of the codon according to the codon bias table would be converted
to a percent to be used for the weighing. This workflow was gradually dropped to be implemented at a later time due to disputes over calculations.
Test:In the process of testing, we faced issues with ‘Maximum Recursion Errors’ that occurred as the program functioned recursively and kept calling itself
indefinitely until the program ran out of memory. This was partially remedied by the use of the forfeit mechanism that would simply pass on to the next site, though
a more elegant solution should be devised. However, with the development of this version of BLACKBIRDCoOp, we successfully managed to reduce the Stealth sites down
to below 5% of their original amount. This was very successful, and in line with studies on editing RMS sites, would increase the transformation efficiency significantly.
Learn: BLACKBIRDCoOp, having shown significant results in respect to modifying gene inserts, was close to a finished product at the end of this cycle.
The results from the program were promising. After adapting it for use by the general public, we condensed it into an installable package for use on the Command Line.
We consulted with Reto Stamm and utilized his Python Packaging Index to turn our program into a usable package. Updating the official iGEM gitLab followed,
as we had yet to activate it. As we finished up this version of BLACKBIRDCoOp, we were left with many possible avenues with which to pursue further development.
From the parts of the project we worked on, more accessibility was first considered, as compatibility with more file types would allow for more widespread
adaptation of the program and encourage research on RMS containing organisms. Furthermore, the variation aspect of results could be further improved to provide a
model more accurate to the conditions within an organism. Finally, more functionality to include user-defined sites to remove and avoid would have made the program
more user-friendly.
A conversion into the Julia programming language was considered, leveraging its faster processing and better data handling to greatly speed up the program,
and is currently in development. A secondary suggestion from our PI was the utilization of Genetic Algorithms to develop a more advanced version of this program as well.
BLACKBIRDCoOp is open-source and available for anyone to build and iterate upon.
The BLACKBIRD version 'Alpha' has achieved notable success in designing custom inserts. The program processes an insert
sequence and refines it through multiple iterations of STEALTH, aiming to reduce the number of recognition sites utilized by the target organism's native restriction enzymes.
BLACKBIRD is a versatile program that can be calibrated for different organisms using a complete strain-specific genome in FASTA
format and a STEALTH software output. The STEALTH [1] output file contains the k-mers that correspond to hypothetical RM recognition sites; the generated k-mers will be replaced
from the custom insert. During the initial development of BLACKBIRD, we utilized the genome and STEALTH sites generated for the well-documented
UTEX 2973 strain. Using these specifications, we customized our three potential inserts accordingly: GFP, CbAgo
and Cas12a.
This process was then replicated for UTEX 3154, our target organism. Since our target organism's genome has not been officially
sequenced, BLACKBIRD currently operates using the PCC 11901 genome, which is closely related to UTEX 3154. The outcomes of BLACKBIRD for
both UTEX 2973 and PCC 11901 are shown in the bar graph.
The number of STEALTH hits after a comparable number of iterations through the program for inserts customized to PCC11901 seem to still contain almost 30% of their
initial number of hits whereas that number is closer to 5% when customized to UTEX 2973. Within the current version of BLACKBIRD, this result can be accounted for
by the fact that the number of STEALTH hits for UTEX2973 and PCC11901 are 283 and 754 respectively.
The STEALTH list of k-mers are generated based on a user-determined cut-off value. We initially ran BLACKBIRD on a False
Discovery Rate score (or bootstrap score) of around 72 as our cut-off. Upon discovering the unreliability of the results, this bootstrap score was raised to 86. Because
it is less lenient when generating the list of k-mers, the overall conditions make for better BLACKBIRD results. The right graph in figure shows
the results of moving this cut-off; in one example, we were able to reduce the number of hits of the PCC11901 customized GFP from 41 to a mere 6. This was a very
predictable result due to the fact that the number of hits for the higher bootstrap score was similar to that of UTEX2973 for all inserts due to the fact that the
number of STEALTH hits for both conditions was both around 280.
References
[1]S. Hu, S. Giacopazzi, R. Modlin, K. Karplus, D. L. Bernick, and K. M. Ottemann, “Altering under-represented DNA sequences elevates bacterial transformation efficiency,” mBio, vol. 14, no. 6, Oct. 2023, doi: https://doi.org/10.1128/mbio.02105-23.