Short Description
This year, our project focuses on developing a model capable of identifying the relationship between diseases and proteins to pinpoint potential biomarkers of diseases. Our approach utilizes a pre-trained Natural Language Processing (NLP) model, which draws on a database comprising notable publications from leading journals such as Nature, Science, and Cell. The model systematically reviews all articles within this database to identify and label significant associations between proteins and diseases.
Subsequently, we have developed a software tool named "Artemis’ Eye." This application enables users to easily access these potential biomarkers for ongoing research and development. Our tool aims to facilitate advanced studies in disease mechanisms and may assist in the discovery of new therapeutic targets.
Furthermore, we aim to go beyond merely developing a "search engine" to identify relationships between proteins and diseases. As a result, we have incorporated generative artificial intelligence (generative AI) to design short peptides capable of binding with potential markers. These peptides are intended to be utilized in our detection system for improved functionality.
Flowchart
Press buttons on the top to see different parts of model
Database establish
Limits to data
Given the substantial volume of data, the time and space complexity of running the model on a local device becomes prohibitive. To mitigate this issue, we imposed certain constraints. From prior analysis, it was observed that the majority of protein sequences in the database are under 2,000 amino acids (A.A.) in length. Consequently, we excluded any sequences exceeding this length.
Similarly, for the disease database, we determined that symptoms could not be used as reliable identifiers due to the large number of diseases associated with the same symptom.(See more in the Engineering page) Therefore, we excluded mostly symptom data from our analysis.
Data collection method
1. Database description
To conduct an analysis of relationships, it is necessary to establish two distinct databases: one for proteins and another for diseases. By linking the data from these two databases, we will create a network to identify and analyze the relationships between them.
2. Protein database establish
For the protein database, we collect data from UniProt, and focus specifically on the reviewed entries. This ensures that the information we used has been manually curated by expert biologists, providing a higher level of reliability and accuracy.
3. Disease database establish
Regarding the disease database, we reviewed multiple databases that store information on symptoms and disease names, including both vernacular and scientific nomenclature. After compiling this data, duplicate names will be removed to ensure the accuracy and integrity of our dataset.
The information of two databases
Protein database
1. Amount
We gathered 234,484 data from Uniprot and stored them into the database.
2. Length
As to the length of data, here is the distribution of whole data whose sequence is under 2,000 A.A.
Disease database
For the disease database, we gathered information from a variety of online sources, with the majority of the data sourced from the Mayo Clinic.[1] A total of 1,150 diseases were included in the creation of the network used for analysis.
Connection establish
NLP model database short description
This pre-trained NLP model is developed using data from two key sources. The first source includes several renowned international journals, such as Nature, Science, and others, which are used to identify relationships between proteins and diseases. The second source consists of statistical analyses of commonly used sentences in everyday language, which aid in evaluating sentence semantics. The journal database includes all papers published in these journals up to January 2024. Additionally, papers available through the NCBI repository are incorporated as part of the data resources.
The limit to the network establish between proteins and diseases
Our primary objective is to identify relationships between diseases and proteins. To alleviate the computational burden posed by two large databases, we structured the relationship network as a bipartite graph. In this structure, one disjoint set represents proteins, while the other corresponds to diseases. This approach reduces computational complexity by eliminating the need to calculate relationships between proteins or between diseases themselves, focusing solely on the connections between the two distinct sets.
How this model finds the relationship with human language
1. The sentence semantic analysis
In the NLP model, two key tasks are performed: determining the location of relevant content and analyzing whether the content is positive or negative. The content refers to the sentences in the paper that mention the marker, while the positive or negative classification indicates whether the sentence suggests that the marker contributes to the disease or helps reduce its effects.
Initially, the model identifies the specific protein in the paper using information such as the entry ID, protein name, and the organism. Once the protein is located, the model analyzes the context of the sentence based on statistical results. For example, in the sentence, "The A protein is overexpressed when the α disease occurs," the model would classify this as positive data, while sentences indicating a reduction of the disease would be classified as negative data. The model also takes into account the arrangement of semantic components, such as Subject, Verb, and Noun, within the sentence. By leveraging these statistical analyses, we can assess the frequency of highly positive or highly negative sentences discussing proteins and diseases, thus identifying strong relationships between specific proteins and diseases.
2. The table analysis
In addition to sentence semantic analysis, another critical component of formal papers is the data presented in tables. Many papers include percentages in tables to indicate the degree of relationship between variables. In this model, we categorize these percentages into four groups for interpretation: (75%, 100%] is classified as very highly related, (50%, 75%] as highly related, (25%, 50%] as not related, and [0%, 25%] as very not related. Additionally, the sentiment—positive or negative—is determined by analyzing the sentences linked to the table data. By combining these categorizations with the table references, we can perform deeper analyses on the papers within our database.
3. The scoring method
The relationship score between diseases and proteins is determined by the frequency of protein mentions in research papers. For each paper, we calculate the number of positive and negative mentions of specific proteins. After reviewing the entire paper, we subtract the number of negative mentions from the positive mentions to obtain the net score for that paper. For example, if a protein is mentioned positively 4 times and negatively 2 times, its final score for that paper is 2. Since the number of papers referencing each protein varies, we calculate an average score to ensure that the final score is not skewed by uneven representation.
The generation tools introduction and the preliminary verify method
We used PeptiMap[3] to input a protein PDB file, specifying the desired interacting chain. The tool then calculated peptide sequences that would interact with the selected protein chain.
The PeptiMap process includes:
‧ Exploration of the protein surface: Fourier-based mapping docks small probe molecules to the protein surface to locate potential peptide binding sites.[4][5]
‧ Refinement: Docked positions are refined to identify clusters representing potential binding sites.[4][5]
‧ Clustering and ranking: Probe clusters are ranked by energy scores and peptide binding sites are identified within the top-ranked clusters.[4][5]
‧ Site filtering: Internal or buried sites inaccessible to peptides are excluded, focusing only on surface accessible regions.[4][5]
‧ Evaluation and validation: PeptiMap has been tested on known peptide-protein complexes, demonstrating high accuracy in identifying peptide binding sites.[4][5]
After generating the peptide sequences, we used PEP-FOLD 4.0[6]from the RPBS web portal to predict the most likely peptide structures. The optimal structures were selected for further investigation.
To verify the binding interaction between the peptides and our biomarker, we used iGEMDOCK[7] for docking. The peptide structures and the biomarker structures were used as inputs, resulting in a PDB file of the peptide docking and a table detailing the interactions between specific residues of the peptide and the protein, which facilitated subsequent evaluations.
The relationship analysis of generated peptide and real-world protein
To ensure the validity of the generated peptides, we performed further analysis on their sequences and structures. For sequence validation, we performed a local alignment between the peptides and CD55 sequences using the BLOSUM62 scoring matrix. For structural validation, we used RCSB PDB's Pairwise Structure Alignment tool[8] to analyze the relative positions of the aligned structures.
The top 10 high-related disease in the database
The analysis to top 10 and the final chosen
From Figure 4, we observe the top 10 most frequently mentioned diseases in the database. However, several entries in the list are not disease names but symptoms, such as dehydration and edema. Additionally, some diseases cover a broad spectrum, making it difficult to identify specific underlying markers. After team discussions, we decided to exclude diseases with these issues and selected leukemia as the final focus for further research.
The underline marker we found
After identifying the diseases of interest, we compiled a list of related proteins. Our primary objective is to develop a platform capable of rapid detection. To achieve this, we prioritized markers that can be easily detected from common samples, such as blood or saliva. Ultimately, CD97 emerged as the most highly associated protein, serving as a key underlying marker. Furthermore, we discovered that CD97, when combined with established leukemia markers such as TP53 and RUNX1[9], enhances the precision of leukemia detection.
The generated short peptides
The sequences of three generated peptides
First, we generated peptide sequences based on the CD97 structure, the following peptide sequences were generated with PeptiMap:
‧ 2BOU-1: VCSPGYEPAKVEQGSYK
‧ 2BOU-2: RCNPGFSSSEIII
‧ 2BOU-3: DWVCSP
The structures of three peptides generated by PEP-FOLD 4.0
Due to software limitations 2BOU-1 could not be docked, but the docking results for 2BOU-2 and 2BOU-3 are provided. As the figure 7 and figure 8 shows, CD97 is shown in green, the peptides in blue and the interacting residues of CD97 in dark green.
According to the iGEMDOCK results, 2BOU-2 interacts with CD97 at Ser-59, Asp-60, Trp-62, Val-71, Cys-72, Ser-73, Pro-74, Gly-75, Tyr-76, Glu-77, Pro-78, Lys-83, Val-96, Gln-103, while 2BOU-3 interacts with residues such as Cys-47, Ala-48, Ser-54, Cys-55, Ser-59, Asp-60, Cys-61, Trp-62. As 2BOU-1 had no docking results, it was excluded from further study.
The relationship of generated short peptides and real-world protein
Melania Capasso, Lindy G. Durrant, and Martin Stacey have discovered that CD55 would bind with CD97.[13] Thus, we took CD55 to compare with our result.
The results of the local alignment are shown in the following figures:Here is the result of pairwise structure alignment. 2BOU-3 did not meet the minimum peptide length requirement for alignment, so there is no result for 2BOU-3.
Furthermore, our research revealed that the binding of CD97 and CD55 is strongly associated with residues 1-4, 48-54, 77-83, 124-127, 130-137 on CD97 and residues 5-12, 40-45, 56-61, 67-74, 78-91, 98-109, 114-125 on CD55. The docking and Pairwise Structure Alignment results for 2BOU-2 show a high degree of overlap with the reported binding sites, increasing confidence in its interaction with CD97.[14]
Model-base software that can search the underline markers
Based on the NLP-based semantic analysis model, we developed a software tool called Artemis’ Eye to assist researchers in easily identifying underlying markers. Additionally, the software labels potential markers that can be detected in body fluids. With Artemis’ Eye, researchers can efficiently locate potential markers and proceed with experiments to validate these findings using the platform designed in our project.
To know more or try to use the software, please go to the software page.
Reference
- Mayo Clinic. (n.d.). Mayo Clinic. https://www.mayoclinic.org/
- GeeksforGeeks. (n.d.). What is a bipartite graph? GeeksforGeeks. https://www.geeksforgeeks.org/what-is-bipartite-graph/
- PeptiMap. (n.d.). PeptiMap server. ClusPro. Retrieved September 8, 2024, from https://peptimap.cluspro.org/
- Lavi A, Ngan CH, Movshovitz-Attias D, Bohnuud T, Yueh C, Beglov D, Schueler-Furman O, Kozakov D Detection of peptide-binding sites on protein surfaces: The first step towards the modeling and targeting of peptide-mediated interactions PROTEINS: Structure, Function, and Bioinformatics 2013 Oct 17.
- Brenke R, Kozakov D, Chuang GY, Beglov D, Hall D, Landon MR, Mattos C, Vajda S. Fragment-based identification of druggable 'hot spots' of proteins using Fourier domain correlation techniques. Bioinformatics. 2009 Mar 1.
- RPBS Mobyle. (n.d.). PEP-FOLD4. Mobyle2 RPBS. Retrieved September 8, 2024, from https://mobyle2.rpbs.univ-paris-diderot.fr/cgi-bin/portal.py#forms::PEP-FOLD4
- National Chiao Tung University. (n.d.). GEMDOCK server. GEMDOCK. Retrieved September 8, 2024, from http://gemdock.life.nctu.edu.tw/
- RCSB PDB. (n.d.). Sequence alignment tool. RCSB Protein Data Bank. Retrieved September 8, 2024, from https://www.rcsb.org/alignment
- Small S, Oh TS, Platanias LC. Role of Biomarkers in the Management of Acute Myeloid Leukemia. Int J Mol Sci. 2022 Nov 22;23(23):14543. doi: 10.3390/ijms232314543. PMID: 36498870; PMCID: PMC9741257.
- Molica M, Perrone S, Mazzone C, Niscola P, Cesini L, Abruzzese E, de Fabritiis P. CD33 Expression and Gentuzumab Ozogamicin in Acute Myeloid Leukemia: Two Sides of the Same Coin. Carcinomas (Basel). 2021 Jun 28;13(13):3214. doi: 10.3390/Carcinomas13133214. PMID: 34203180; PMCID: PMC8268215.
- Vaikari, V. P., Wu, S., & Alachkar, H. (2018). CD97 is associated with poor overall survival in acute myeloid leukemia. Blood, 132(Supplement 1), 2794. https://doi.org/10.1182/blood-2018-99-120324
- Martin, G. H., Desrichard, A., Chung, S. S., Woolthuis, C., Hu, W., Garrett-Bakelman, F. E., Hamann, J., Chan, T., & Park, C. Y. (2016). CD97 is a critical regulator of acute myeloid leukemia stem cell function. Blood, 128(22), 1077. https://doi.org/10.1182/blood.V128.22.1077.1077
- Melania Capasso, Lindy G. Durrant, Martin Stacey, Siamon Gordon, Judith Ramage, Ian Spendlove; Costimulation via CD55 on Human CD4+ T Cells Mediated by CD97. J Immunol 15 July 2006; 177 (2): 1070–1077. https://doi.org/10.4049/jimmunol.177.2.1070
- Abbott, R. J. M., Spendlove, I., Roversi, P., Fitzgibbon, H., Knott, V., Teriete, P., McDonnell, J. M., Handford, P. A., & Lea, S. M. (2007). Structural and functional characterization of a novel T cell receptor co-regulatory protein complex, CD97-CD55. Journal of Biological Chemistry, 282(30), 22023–22032. https://doi.org/10.1074/jbc.M702588200