Motivation

In this year's project, as in previous years, we needed to design various genetic circuits, which required constructing multiple complex plasmids. Plasmid construction involves several experimental procedures such as PCR, assembly, and ligation, during which various unexplained errors can occur (e.g., smear bands in PCR, false orientation of insert in transformation, etc.).

This year, we developed software that uses machine learning to propose sequence optimizations and improvements for these unexplained errors. By using this software, we can design optimal plasmids from the start, constructing them without wasting effort or resources. We believe this will contribute not only to iGEMers who must conduct experiments in limited time but also to researchers worldwide who struggle with complex plasmid construction.

The source code of this software is uploaded to GitLab.

Software Overview

Highlights

(If made public) A user-friendly Software application with an easy-to-use UI published on a web page, requiring no docker or similar tools.
Easy-to-use package
Extensive documentation
Sequence information can be input as strings, making it compatible with various file formats
A software tool necessary for all iGEMers' projects

This page explains the technologies used in software development. This year, the Kyoto team developed software called "Muramasa" to assist in plasmid design. This software has two functions: 1) Quantifying the "complexity" of plasmid sequences, and 2) Identifying the causes of complexity.

In this development, we used anomaly detection, a type of machine learning. By utilizing existing plasmid sequence databases and learning from those sequences, we extracted features of "constructable" plasmids in two ways. By comparing this learned data with the desired plasmid data, we made it possible to predict the complexity of the target plasmid sequence and the sequences that would be hurdles in construction. We also implemented a function to suggest improvements to make the sequence closer to a "constructable" one.

Using this software tool, you can check whether your plasmid is likely to fail with just one button press when designing it.

The source code of this software is uploaded to GitLab.

Machine Learning and Anomaly Detection

What is & Why Anomaly Detection?

In this software, we used a machine learning technique called anomaly detection. This is described as "identifying rare items that deviate significantly from the majority of the data and do not conform to a well-defined notion of normal data"[1].

concept of anomaly detection

When quantifying plasmid complexity, there are many reports of "constructable" positive data, but extremely few reports of "difficult to construct" negative data. To resolve this imbalance in training data, we focused on anomaly detection, which can predict results by learning only from positive data.

There are four algorithms for anomaly detection: classification, probabilistic, reconstruction, and distance, which are further divided by whether they use NN (Neural Network) or not. We focused on the reconstruction model among these and used k-means as an algorithm that doesn't use NN.

By avoiding the black-boxing of the anomaly detection process by not using NN, we made it possible to visualize which parts of the plasmid sequence have anomalies when an anomaly is detected.

For features, we used GC content, sequence length, various types of repetitive sequences, whether it has high GC content, and whether it has high AT content.

Learning Data

As learning data for "constructable" plasmid sequences, we used sequence data from online plasmid repositories. We used 5,837 plasmid sequences from Addgene[2] as training data.

Test Data

For test data of "constructable" plasmid sequences, we used 181 plasmid sequences from Addgene[2], and for test data of "difficult to construct" plasmid sequences, we used plasmid sequences that were actually difficult to create in iGEM Kyoto 2021 and 2023.

Contribution Analysis

When judged as "anomalous", we calculate which parts of the input plasmid contribute to the degree of anomaly.

For data judged as anomalous, we perform calculations to visualize the contribution using the distance values between data points and cluster centers for each feature.

By displaying the part with the highest contribution, i.e., the most anomalous part of the sequence, we can help users improve their plasmids.

Results

Accuracy Evaluation

As an evaluation of the packaged model, we produced the ROC curve and AUC value for the test data.

The vertical axis of this ROC curve is the true positive rate, and the horizontal axis is the false positive rate. AUC takes values from 0 to 1, with values closer to 1 indicating higher model accuracy. It's evaluated as 0.5-0.6: Failed, 0.6-0.7: Worthless, 0.7-0.8: Poor, 0.8-0.9: Good, >0.9: Excellent[3].

As of October 2, the AUC value was 0.77. This indicates that high accuracy could not be achieved.

However, when we evaluated the accuracy using the same algorithm in the development environment Google Colabolatory before packaging, the result was as shown in the figure below.

This model boasts an accuracy rated as Excellent with AUC = 0.91634, and at the best threshold, the True Positive Rate is about 0.86 and the False Positive Rate is about 0.09, which are reliable figures.

This is thought to be due to some mistake or version mismatch during packaging, and in the future, we will be able to make judgments with the same accuracy for packages by resolving the cause.

Contribution Analysis

When an abnormality is detected, as shown in the image, its cause is indicated. This feature allows users to easily understand which parts of their plasmid design need to be corrected.

Reference

[1]Chandola V, Banerjee A, Kumar V. Anomaly Detection: A Survey. ACM Computing Surveys. 2009;41(3):1-58. doi: https://dl.acm.org/doi/10.1145/1541880.1541882

[2]Addgene: Homepage. Addgene.org. Published 2019.https://www.addgene.org/

[3] Polo TCF, Miot HA. Aplicações da curva ROC em estudos clínicos e experimentais. Jornal Vascular Brasileiro. 2020;19. doi: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8218006/