EGOAL: a gene Expression prediction model based on Gene Ontology and Abductive Learning

EGOAL is an attempt of introducing the Holy Grail of Artificial Intelligence — combination of Machine Learning and Symbolic Reasoning ^[1], aka. Neural-Symbolic AI — into the sight of Synthetic Biology.

We are trying to assist researches and drive innovations in SynBio, via Abductive Learning, the novel paradigm of neural-symbolic AI proposed by researchers from Nanjing University ^[1],[2].

Note. In our project, we focus exclusively on Gene Ontology (GO) ^[3],[4] as the knowledge source. Consequently, we will use the terms "knowledge base", "knowledge graph" and "ontology" interchangeably, as GO is encompassed within each of these, despite the slight differences in terminology.

1. Introduction

Figure1: A simple illustration to our project

Gene expression and regulation pathway is an important subject in synthetic biology. With knowing the genes and pathways involved in a process, biological mechanisms can be explained . Subsequently, we can regulate the genes via synthetic biological methods, enabling further improvements toward the experiments.

While the most usual approach of testing expression is transcriptome sequencing, the high cost and great amount of data analysis prevents it from many application scenarios.

Therefore, flexible and low-cost AI-based solutions are explored. In earlier studies, only traditional machine learning models, including deep learning, are considered ^{[5],[6],[7],[8]}. Traditional machine learning methods rely heavily on historical data, cannot perform well in lack of high quality and large amount of data.

We constructed a gene expression prediction model based on Abductive Learning, a prominent neural-symbolic AI paradigm. By leveraging the Gene Ontology, it generates accurate predictions with minimal reliance on historical data, demonstrating the effectiveness of neural-symbolic AI in the field of synthetic biology.

The knowledge base, i.e. Gene Ontology, contains the relations of gene regulation in its graph structure ^[3],[4] , which can be learned by the model. Therefore, we can predict expression of genes on the regulation pathway accurately, from the condition and result of the experiment.

2. Software and Code

Our project code is available at github.com/yfxiang0112/EGOAL. Everyone can predict their experiments with our project freely.

Note that our project works only on Shewanella Oneidensis genome currently, and support to more organisms will be added in future versions.

2.1 Quick Start

git clone git@github.com:yfxiang0112/EGOAL.git
cd EGOAL
pip install -r requirements.txt

To Predict Expression Directly with GO Terms

python src/abl/predict -d True -i examples/NADK/input_terms.txt -o examples/NADK

To Predict Expression with Natural Language Descriptions

python src/abl/predict -i examples/GAPDH/input.txt -o examples/GAPDH

2.2 Graphic Interface

We constructed a GUI based on pywebio for our project. To use it, run:

python src/predict/ui.py

3. Results

Tow groups of evaluation in practical applications of our wet lab were performed, including:

Suggesting the improvement of wet lab, based on an observation of electron transduction and phosphorous absorption.
Giving a possible explanation to the improvement above, from genes with a proved effect in electricity generation.

3.1 Finding Genes of Both Phosphorus Metabolism and Electricity Generation

A relation between phosphorus metabolism and electricity generation is noticed in earlier wet labs that, the electron transduction and electricity generation increase with phosphorus concentration when it is relatively low, and suddenly be inhibited under enough high concentration of phosphorus.

We tested the observation in our AI model, and the result shows a considerable significance of genes NADK (SO_1523, SO_3077), PPX (SO_2185) and PAP (SO_3975). This is consistent to our improvement strategy of transforming enzymes in the phosphorus hydrolysis pathway, including PPX, PPK, PAP and NADK and constructing engineered bacteria, indicating that our model can actually give advice to design and improvement of wet lab.

Figure2: Prediction confidence of NADK (SO_1523, SO_3077), PPX(SO_2185) and PAP(SO_3975).

3.2 Helping to Explain and Understand Experiment Observations

Several genes that proved to be involved in electricity generation process of Shewanella ^[9] are shown in the prediction result, indicating that these genes might be regulated by the enzymes we used to construct the engineered bacteria in wet lab.

Our model predicts a changed expression level of crp in engineered bacteria transformed with PPK1 and PPK2. Changed expression of NADH related enzymes pflB, fdhD, fdhE and membrane protein MtrB are shown in all experiment groups, including NADK, PAP, PPK1, PPK2, PPX.

Figure3: Classification confidence of 5 wet lab groups (engineered bacteria transformed with NADK, PAP, PPK1, PPK2, PPX) and 1 control gene GAPDH (not used in actual experiment)

Earlier studies have found effects on electricity production of proteorhodopsin and cAMP-responsive catabolite regulators, regulated by gene crp (SO_0624), in Shewanella genome. The crp gene is also proved to regulate lactate intake and utilization, as lactate is the major carbon source in our wet labs.

NAD⁺ relied enzymes pflB and fdh family also known as regulator of NADH synthesis and electricity output. MtrB is also involved in electricity generation.

Based on the results, we can propose the assumption that phosphorus hydrolyze improves electricity generation via following mechanisms:

increasing carbon metabolism and energy utilization.
influence synthesis of NADH from NAD⁺.

Figure4: Predicted regulation pathways of 5 wet lab groups (engineered bacteria transformed with NADK, PAP, PPK1, PPK2, PPX)

4. Neural-Symbolic AI and Abductive Learning

4.1 Neural-Symbolic AI: Bridging Learning and Reasoning

Neural-symbolic AI is a type of artificial intelligence that integrates machine learning (neural AI) with knowledge reasoning (symbolic AI) to leverage the strengths and mitigate the weaknesses of each approach.

Machine learning emulates the learning and intuitive prediction capabilities of human intelligence. It uses probabilistic and statistical methods to model datasets with probability distributions, making predictions based on these assumed distributions. In contrast, knowledge reasoning involves rigorous logical inference based on formal logics, accurately reflecting the algebraic structure of a system derived from high-level domain knowledge or rules.

By bridging both neural and symbolic systems, neural-symbolic AI enables the creation of more powerful and accurate models. In the context of biology, for example, probabilistic learning can uncover patterns from empirical data and explore possible mechanisms and explanations, while logical reasoning ensures that the results strictly adhere to universal biological principles.

4.2 Abductive Learning: a Novel Framework of Neural-Symbolic Method

Abductive Learning (ABL) ^[1],[2] is a prominent neural-symbolic AI paradigm that integrates machine learning with knowledge reasoning. ABL leverages logic abduction to incorporate knowledge, formalized as First-Order Logic (FOL) rules, into machine learning systems.

Abduction is a form of logical inference that seeks the most plausible underlying facts as explanations for observations. For instance, in biological applications, it can identify the most reasonable explanations for observed results in a wet lab.

Abductive learning is an iterative algorithm comprising both a learning and a reasoning phase. In the learning phase, a base model is trained on a dataset and predicts the outcomes of unseen data samples as pseudo-labels. In the reasoning phase, the reasoner corrects these pseudo-labels to ensure their logical consistency, based on the Gene Ontology knowledge base, and the base model is retrained with the revised labels.

Formally, abductive learning trains a function f such that:

where O denotes the pseudo-groundings implied by labeled data X and f(X), and Δ(O) is revised pseudo-groundings which are consistent with the knowledge base KB.

A figure illustrates this process (figure by Zhou. et al) ^[1],[2]:

5. Methods

5.1 Framework

EGOAL is a gene expression prediction model, trained on the Gene Ontology (GO) knowledgebase and a dataset collected from Gene Expression Omnibus (GEO). Our work consists of 3 parts:

construction of dataset;
preprocessing of knowledge base;
training on dataset and knowledge base with abductive learning.

A labeled dataset is collected from NCBI Gene Expression Omnibus (GEO) ^[10] database. An entity alignment based on natural language embedding is performed to convert experiment description in GEO to knowledge representations of the knowledge base.

To convert the large-scale knowledge graph into a rule set that is feasible and tractable for training, a series of preprocessing are applied to the GO knowledge base, including subgraph extraction, rule mining and remembering algorithm ^[11].

The neural-symbolic model is built under the abductive learning framework ^[12]. To target the vast output space of predicting the Shewanella genome, the model is partitioned for each gene.

5.2 Constructing Dataset from GEO Sample Data

We constructed our dataset with gene expression data of past experiments. We obtained 40 expression series and 426 samples of Shewanella MR-1, collected from NCBI Gene Expression Omnibus (GEO) database ^[10].

Unifying Numerical Values of Expression Data

Since different sequencing platforms are employed in each data sample, the numerical value between data samples might vary. To solve this problem, we applied Differential Expressed Gene Analysis (DEG) tools with sequence analysis library pyDeseq2 ^[13] to unify the gene expression of data entries as over-expression, under-expression and not changed, according to their log2 fold change value.

Entity Alignment with Natural Language Embedding

To predict gene expression with experiment conditions and results, inputs in natural language is involved. To exploit the knowledge rules in an ontology (GO in our project), entities described with natural language should be aligned to corresponding knowledge representations in the ontology (GO terms in our project).

We employed a transformer-based sentence embedding tool sBERT ^[14] that converts text to real-valued vectors. With both data samples turned into vectors according to their experiment description, and GO terms to vectors according to their annotations, we can compare the similarity of GO terms and data samples, select a set of GO terms to represent a data sample.

Figure5: Embedded vectors of GO terms (blue) and data instances (red)

5.3 Knowledge Graph Preprocessing: Remembering

Knowledge graphs (KGs), such as GO, often encompass extensive information, much of which may be irrelevant to specific learning tasks, such as predicting gene expression in the Shewanella genome. To enable efficient reasoning and abductive learning on these large-scale KGs, we implemented a series of preprocessing methods to extract logical rules and filter out irrelevant information.

Three main procedures are involved in preparing the knowledge graph for abductive learning:

Subgraph Extraction

Given the comprehensive nature of large-scale knowledge graphs like GO, only a subset relevant to the Shewanella genome should be utilized for model training.

To exclude irrelevant sections of the ontology, we retained only the pertinent GO terms—specifically, those present in our dataset and in the GO annotations for the Shewanella genome—as well as terms closely related to them.

Rule Mining

Abductive learning requires first-order logic (FOL) rules for reasoning, while an OWL-formatted knowledge graph consists of RDF triplets (subject, predicate, object). To convert these triplets into FOL rules, we apply rule mining based on the natural semantics of the predicates within the knowledge graph.

In our project, we unified the set of rules as conjunctions of implications.

The Remembering Algorithm

The Remembering algorithm, proposed by Huang et al. ^[11], enhances the efficiency of abductive learning by refining a set of First-Order Logic (FOL) rules based on an assumption of transitivity. This process ensures that the rule set fully preserves the original information while retaining only the desired terms.

The formal description is as follows:

5.4. Abductive Learning

Basic Abductive Learning Framework: ABLkit

ABLkit is a highly flexible toolkit designed to support the primary workflows of Abductive Learning ^[12].

For this project, we developed an ABL framework using ABLkit. This involved constructing the learning component with a foundational machine learning model, developing the reasoner and specifying its logical computations based on predefined rules, and integrating the learning and reasoning components to enable abductive learning.

Handling Multi-label Classification

Identifying regulated genes among the more than 4,000 genes in the Shewanella genome presents a challenging learning task, characterized by a vast search space of 2^genome-size when using a single classifier model.

Considering the extensive output space, we partitioned the task and developed a binary classifier for each gene to predict its variation in expression levels. By dividing the prediction task, we can employ simpler learning models for each subtask.

We also adapted the Remembering algorithm for the segmented model. For each subtask, we devised a rule set that includes only terms from the dataset and the corresponding gene's GO annotation, thereby enhancing the efficiency of training.

Given that the subtasks are largely data-independent, the degree of concurrency can be significantly increased, allowing for the models to be trained using parallel computation methods. This approach has demonstrated a substantial improvement in training time efficiency, reducing the duration from 27 hours to 30-40 minutes in practical evaluations.

Revising pseudo-labels by abduction

For each subtask involving a specific gene, an ABL model is trained on a specially constructed rule set, where pseudo-labels are evaluated based on the number of logic rules they violate. Given a pair of inputs and their corresponding pseudo-labels, we use the weighted count of violated rules (i.e., those evaluated as "False" with respect to the pair) to identify conflicting pseudo-labels. Logic abduction is then employed to identify possible labels that conform to the knowledge base. The revised labels are selected by maximizing the consistency between the new labels and the knowledge base.

A formal description as follows:

Exploiting Unlabeled Data

As a semi-supervised method, an ABL model can be trained using an unlabeled dataset ^[16]. The machine learning component, or base model, can predict a pseudo-label for an unlabeled input, which is then revised by the reasoning component. This revised pseudo-label is subsequently used to retrain the base model.

To compensate for the lack of labeled Shewanella experimental data, we collected experimental descriptions of E.coli, which may involve similar experiments but have different expression data on different genomes. Approximately 3000 unlabeled data samples were obtained from GEO. This approach resulted in a considerable improvement in model performance, as evidenced by both increased accuracy and enhanced practical applications.

5.5 Test and Evaluation

A substantial improvement in test dataset accuracy is shown after employing abductive learning, in comparison to the base model trained on the labeled dataset.

Figure6: Learning Component accuracy of all subtask models

Figure7: Reasoning Component accuracy of all subtask models

We also constructed a baseline model, based on deep learning and knowledge embedding method. In detail, we used knowledge base embedding tool owl2vec ^[15] to convert knowledge representations in data samples into vectors, and trained a 3-layered 1553*1553 deep neural network (CNN) taking embedding vectors as input.

In evaluation on the test dataset, a significant outperform of abductive learning than embedding method is observed, demonstrating the effectiveness of our method in comparison to traditional deep learning.

Figure8: Accuracy of embedding and neural network-based method

References

[1]: Zhou Z. H., Abductive learning: Towards bridging machine learning and logical reasoning, Science China Information Sciences, 2019.

[2]: Zhou Z. H., Huang, Y. X., Abductive learning, In P. Hitzler and M. K. Sarker eds., Neuro-Symbolic ArtificialIntelligence: The State of the Art, 2022.

[3]: The Gene Ontology Consortium, The Gene Ontology project in 2008, Nucleic Acids Research. 36, 2008.

[4]: Christophe D., Nives S., et al., The Gene Ontology Handbook. Methods in Molecular Biology. Vol. 1446, 2017.

[5]: Al taweraqi N., King R. D., Improved prediction of gene expression through integrating cell signaling models with machine learning, BMC Bioinformatics, 2022.

[6]: Chen Y., Li Y., Narayan R., Subramanian A., Xie X., Gene expression inference with deep learning, Bioinformatics, 32(12), 2016, 1832-1839.

[7]: Beer M. A., Tavazoie S., Predicting Gene Expression from Sequence, Cell, Vol. 117, 185-198, April 16, 2004.

[8]: Singh R., Lanchantin J., Robins G., Qi Y., DeepChrome: deep-learning for predicting gene expression from histone modifications, Bioinformatics, 32, 2016, i639-i648.

[9]: Zhang J., Li F., et al., Engineering extracellular electron transfer pathways of electroactive microorganisms by synthetic biology for energy and chemicals production, Chem. Soc. Rev., 53, 2024, 1375-1446.

[10]: Edgar R., Domrachev M., Lash A. E., Gene Expression Omnibus: NCBI gene expression and hybridization array data repository, Nucleic Acids Research. 30 (1): 207-10, 2002.

[11]: Huang Y., et al., Enabling Abductive Learning to Exploit Knowledge Graph, Proceedings of the 32nd International Joint Conference on Artificial Intelligence, 2023

[12]: Huang Y. X., Hu W. C., Gao E. H., Jiang Y., ABLkit: A Python Toolkit for Abductive Learning. Frontiers of Computer Science, 2024.

[13]: Boris M., Maria T., Vincent C., Mathieu A., PyDESeq2: a python package for bulk RNA-seq differential expression analysis, Bioinformatics, 2023.

[14]: Reimers N., Iryna Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.

[15]: Chen J., Hu P., Jimenez-Ruiz E., et al., OWL2Vec*: embedding of OWL ontologies. Mach Learn, 2021.

[16]: 周志华, 机器学习, 清华大学出版社, 2016. (Zhou Z., Machine Learning, Tsing-Hua University Press, 2016.)