Overview

Using causal analysis methods for statistical analysis is a solution to the "explanatory crisis" in complex systems. We introduced causal graph model testing and discovery, multi-order breakpoint analysis, and multi-objective optimization methods, establishing a statistical analysis framework to quantify causal effects and assist in experimental design. The framework is provided in two forms: a Python library and a highly integrated web application.

Highlight

Figure 1 Dye market development timeline.

User-friendly UI, deployed on global server, accessed from browser
API easy to distribute, synchronized in PyPI community
Compatible with GEO and ArrayExpress formats.
Validated by experimental work
Improving benchmark performance with ensemble learning
Detailed documentation, example notebooks and tutorial videos

Applied Topic

Screening of Point Mutation Sites

In pigment synthesis pathways of our wetlab, a key enzyme we identified, vioE, is the limiting factors that determine yield. In current computational studies of protein rational design, integrating various molecular properties and shaping the energy landscape to extract precise features is a significant challenge. Traditional multi-objective optimization methods are easily affected by confounding variables, leading to poor interpretability of the optimization process and making it difficult to extract principles from successful designs[1].

To address this issue, the software developed a Bayesian optimization method that eliminates confounding factors, successfully identifying a mutant(BBa_K5319716) with enzyme activity at 132% of the wild type in regions with maximum predictive expectation. Detailed experimental design and data can be found here . The detailed process of software is as follows:

1. Graph build. We constructed a causal relationship graph for enzyme activity contributions using heterogeneous ensemble learning, based on enzyme activity data obtained from a dual-random point mutation library (including but not limited to molecular properties and energy landscape data).

Figure 1. Causal Graph Model, which indicates that the Pocket Volume and Length significantly affect Enzyne Activity but the causal effect of Bottleneck Radius and Binding Energy to Enzyne Activity is not that significant

2. Confuse factors exclusion. Build a baseline by rearranging graph relationships, and perform LMC and TPa tests on the graph model constructed in the previous step;

Figure 2.number of class-fraction of violations plot. The dashed lines represent the fraction of violation of Given DAG, the bars represent the number of randomly permuted DAGs with a certain fraction of violations . The blue and orange dashed lines appear on the left side of most corresponding bars which indicates the given DAG is informative

3. Multi-objective optimization. Based on the model that passes the second step test, the data is recombined, and multiple rounds of one-step bayesian sampling are performed to generate the expectation hotspot map.

Figure 3.Black box Function, Mean, Var, Expectation

Based on this model, we were able to screen for samples with the greatest potential for enzyme activity enhancement from a larger virtual mutant library. The accuracy of our software was validated by experimental results, which identified a mutant that increased enzyme activity by 32% within the predicted region (pocket size of 600 ± 100 or 1200 ± 100, with a channel length of 6 ± 2).

Benchmarking

Single cell causal protein-signaling networks discovery

The causal discovery algorithm part was evaluated using the classic Sachs Protein dataset [2]. The data consist in the simultaneous measurements of 11 phosphorylated proteins and phospholipids derived from thousands of individual primary immune system cells, subjected to both general and specific molecular interventions.

	Accuracy(%)	Recall(%)	Path Coverage (%)
PC	33.05	52.63	78.98
GES	61.98	47.37	84.97
DAG-GNN	57.85	36.84	36.84
NOTEARS-MLP	50.41	10.52	42.11
expAscribe	71.07	42.01	100.00

Table 1. Evaluation Metrics of our algorithm

expAscribe can be seen exceeding the baseline in Accuracy and Path Coverage. In particular, our algorithm predicts 100% of the causal pathways in the sachs consensus causal network. For more details, see Case Study-Case 2 in technical manual.

Path Coverage is evaluated by:

$$ Score(G_{true},G_{pred}) = \frac {\sum_{i=1}^{|V|^2} \mathbb{I}_i}{|E(G_{true})|}, $$

where, $ \mathbb{I}_i $ if and only if for ith edge in $ G_{true} $ also exists in $ G_{pred} $.

Figure 4. our predicted graph and the classic signaling network and points of intervention of Sachs dataset[1]. Comparing the two graphs, it is found that the relationships predicted correctly include but are not limited to those between PIP3 and PIP2, PKA and p38, and some direct connections are presented by indirect connections

Theory

The theoretical basis, mathematical derivation and technical details of expAscribe can be obtained here

DevTool Chain

The bedrock of Python library development is twine, which provides build system independent uploads of source and binary distribution artifacts for both new and existing projects.

The Framework where Webapp built on is Streamlit. We embed HTML, CSS and JavaScript in the framework for custom builds. Streamlit technical features, especially the session and cache, allow for a more orderly workflow and more robust data flow for our data science applications.

Figure 5. Streamlit advanced property

We containerize webapp with docker, ship it to Azure Container Registry and deploy the app by Azure Webservice SKU B3 located in West Europe, with data stored in Azure StorageV2.

Database Compatibility

The python library expAscribe aims to improve the scalability of the data that can be processed, accepting data from mainstream databases such as GEO and ArrayExpress in addition to data from traditional sources. ExpAscribe IO Module is designed to be compatible with GEO "Series Matrix File(s)" file format and ArrayExress "E-MTAB" file format.

Figure 6. supported GEO Series Matrix File(s)

Figure 6. supported ArrayExpress E-MTAB File(s)

1. The IO module cannot parse some data files whose annotations are not standard enough;

2. Parsing is currently only supported in the python library.

Resource

Webapp

access

Figure 8. A sight of Webapp UI

Source Code gitlab

access

Installation

Our python library has been synced to PyPI.

pip install expAscribe

Webapp Source code is available in our Gitlab Repo. Build the app locally by:

git clone https://gitlab.igem.org/2024/software-tools/zju-china
cd zju-china/web
docker build -t  .
docker run -p 8501:8501

Documentation

API manual

access

Figure 9. A Sight of Docs Page

Webapp Tutorial Video

Team Theme-related

Software development echoes team theme--"NeovioDye", and specially made an app prototype of user-defined picture drawing. With the support of hardware optical control projection, user-defined fashion printing and dyeing patterns were realized.

Figure 10. Workflow of Unidye

In our vision, user-defined patterns are not replicable. At present, we achieve this through public key cryptography[3] and picture steganography, which is too classical and rigid. In our vision, a more innovative and interesting protocol can be achieved through blockchain NFT(Non-Fungible Token) technology.

Interactive interface: From left to right, it is the drawing interface, the community photo wall, and the photo unriddle interface

Figure 8. A sight of Webapp UI

Reference

[1] Kortemme T. De novo protein design—From new structures to programmable functions[J]. Cell, 2024, 187(3): 526-544.

[2] Karen Sachs et al. ,Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data.Science308,523-529(2005).DOI:10.1126/science.1105809

[3] R.L. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, Volume 21, Issue 2 Pages 120 - 126 https://doi.org/10.1145/359340.359342

See a more detailed list of references at Technical Manual