Overview
Using causal analysis methods for statistical analysis is a solution to the "explanatory crisis" in complex systems. We introduced causal graph model testing and discovery, multi-order breakpoint analysis, and multi-objective optimization methods, establishing a statistical analysis framework to quantify causal effects and assist in experimental design. The framework is provided in two forms: a Python library and a highly integrated web application.
Highlight
-
User-friendly UI, deployed on global server, accessed from browser
-
API easy to distribute, synchronized in PyPI community
-
Compatible with GEO and ArrayExpress formats.
-
Validated by experimental work
-
Improving benchmark performance with ensemble learning
-
Detailed documentation, example notebooks and tutorial videos
Applied Topic
Screening of Point Mutation Sites
In pigment synthesis pathways of our wetlab, a key enzyme we identified, vioE, is the limiting factors that determine yield. In current computational studies of protein rational design, integrating various molecular properties and shaping the energy landscape to extract precise features is a significant challenge. Traditional multi-objective optimization methods are easily affected by confounding variables, leading to poor interpretability of the optimization process and making it difficult to extract principles from successful designs[1].
To address this issue, the software developed a Bayesian optimization method that eliminates confounding factors, successfully identifying a mutant(BBa_K5319716) with enzyme activity at 132% of the wild type in regions with maximum predictive expectation. Detailed experimental design and data can be found here . The detailed process of software is as follows:
1. Graph build. We constructed a causal relationship graph for enzyme activity contributions using heterogeneous ensemble learning, based on enzyme activity data obtained from a dual-random point mutation library (including but not limited to molecular properties and energy landscape data).
2. Confuse factors exclusion. Build a baseline by rearranging graph relationships, and perform LMC and TPa tests on the graph model constructed in the previous step;
3. Multi-objective optimization. Based on the model that passes the second step test, the data is recombined, and multiple rounds of one-step bayesian sampling are performed to generate the expectation hotspot map.
Based on this model, we were able to screen for samples with the greatest potential for enzyme activity enhancement from a larger virtual mutant library. The accuracy of our software was validated by experimental results, which identified a mutant that increased enzyme activity by 32% within the predicted region (pocket size of 600 ± 100 or 1200 ± 100, with a channel length of 6 ± 2).
Benchmarking
Single cell causal protein-signaling networks discovery
The causal discovery algorithm part was evaluated using the classic Sachs Protein dataset [2]. The data consist in the simultaneous measurements of 11 phosphorylated proteins and phospholipids derived from thousands of individual primary immune system cells, subjected to both general and specific molecular interventions.
Accuracy(%) | Recall(%) | Path Coverage (%) | |
---|---|---|---|
PC | 33.05 | 52.63 | 78.98 |
GES | 61.98 | 47.37 | 84.97 |
DAG-GNN | 57.85 | 36.84 | 36.84 |
NOTEARS-MLP | 50.41 | 10.52 | 42.11 |
expAscribe | 71.07 | 42.01 | 100.00 |
expAscribe can be seen exceeding the baseline in Accuracy and Path Coverage. In particular, our algorithm predicts 100% of the causal pathways in the sachs consensus causal network. For more details, see Case Study-Case 2 in technical manual.
Path Coverage is evaluated by:
$$ Score(G_{true},G_{pred}) = \frac {\sum_{i=1}^{|V|^2} \mathbb{I}_i}{|E(G_{true})|}, $$where, \( \mathbb{I}_i \) if and only if for ith edge in \( G_{true} \) also exists in \( G_{pred} \).
Theory
The theoretical basis, mathematical derivation and technical details of expAscribe can be obtained here
DevTool Chain
The bedrock of Python library development is twine, which provides build system independent uploads of source and binary distribution artifacts for both new and existing projects.
The Framework where Webapp built on is Streamlit. We embed HTML, CSS and JavaScript in the framework for custom builds. Streamlit technical features, especially the session and cache, allow for a more orderly workflow and more robust data flow for our data science applications.
We containerize webapp with docker, ship it to Azure Container Registry and deploy the app by Azure Webservice SKU B3 located in West Europe, with data stored in Azure StorageV2.
Database Compatibility
The python library expAscribe aims to improve the scalability of the data that can be processed, accepting data from mainstream databases such as GEO and ArrayExpress in addition to data from traditional sources. ExpAscribe IO Module is designed to be compatible with GEO "Series Matrix File(s)" file format and ArrayExress "E-MTAB" file format.
1. The IO module cannot parse some data files whose annotations are not standard enough;
2. Parsing is currently only supported in the python library.
Resource
Source Code gitlab
Installation
Our python library has been synced to PyPI.
pip install expAscribe
Webapp Source code is available in our Gitlab Repo. Build the app locally by:
git clone https://gitlab.igem.org/2024/software-tools/zju-china cd zju-china/web docker build -t. docker run -p 8501:8501
Team Theme-related
Software development echoes team theme--"NeovioDye", and specially made an app prototype of user-defined picture drawing. With the support of hardware optical control projection, user-defined fashion printing and dyeing patterns were realized.
In our vision, user-defined patterns are not replicable. At present, we achieve this through public key cryptography[3] and picture steganography, which is too classical and rigid. In our vision, a more innovative and interesting protocol can be achieved through blockchain NFT(Non-Fungible Token) technology.
Interactive interface: From left to right, it is the drawing interface, the community photo wall, and the photo unriddle interface
Reference
[1] Kortemme T. De novo protein design—From new structures to programmable functions[J]. Cell, 2024, 187(3): 526-544.
[2] Karen Sachs et al. ,Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data.Science308,523-529(2005).DOI:10.1126/science.1105809
[3] R.L. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, Volume 21, Issue 2 Pages 120 - 126 https://doi.org/10.1145/359340.359342
See a more detailed list of references at Technical Manual