Loading...

Model

Model_ CHELO

Short Description

 This year, our project focuses on developing a model capable of identifying the relationship between diseases and proteins to pinpoint potential biomarkers of diseases. Our approach utilizes a pre-trained Natural Language Processing (NLP) model, which draws on a database comprising notable publications from leading journals such as Nature, Science, and Cell. The model systematically reviews all articles within this database to identify and label significant associations between proteins and diseases.
 Subsequently, we have developed a software tool named "Artemis’ Eye." This application enables users to easily access these potential biomarkers for ongoing research and development. Our tool aims to facilitate advanced studies in disease mechanisms and may assist in the discovery of new therapeutic targets.
 Furthermore, we aim to go beyond merely developing a "search engine" to identify relationships between proteins and diseases. As a result, we have incorporated generative artificial intelligence (generative AI) to design short peptides capable of binding with potential markers. These peptides are intended to be utilized in our detection system for improved functionality.

Flowchart

flowchart
Figure 1. The flowchart to develop our model and create the software.



Press buttons on the top to see different parts of model

Database establish

Limits to data

 Given the substantial volume of data, the time and space complexity of running the model on a local device becomes prohibitive. To mitigate this issue, we imposed certain constraints. From prior analysis, it was observed that the majority of protein sequences in the database are under 2,000 amino acids (A.A.) in length. Consequently, we excluded any sequences exceeding this length.
 Similarly, for the disease database, we determined that symptoms could not be used as reliable identifiers due to the large number of diseases associated with the same symptom.(See more in the Engineering page) Therefore, we excluded mostly symptom data from our analysis.

Data collection method

1. Database description
 To conduct an analysis of relationships, it is necessary to establish two distinct databases: one for proteins and another for diseases. By linking the data from these two databases, we will create a network to identify and analyze the relationships between them.

2. Protein database establish
 For the protein database, we collect data from UniProt, and focus specifically on the reviewed entries. This ensures that the information we used has been manually curated by expert biologists, providing a higher level of reliability and accuracy.

3. Disease database establish
 Regarding the disease database, we reviewed multiple databases that store information on symptoms and disease names, including both vernacular and scientific nomenclature. After compiling this data, duplicate names will be removed to ensure the accuracy and integrity of our dataset.


The information of two databases

Protein database

1. Amount
 We gathered 234,484 data from Uniprot and stored them into the database.

2. Length
 As to the length of data, here is the distribution of whole data whose sequence is under 2,000 A.A.

length distribution
Figure 2. The length distribution of the whole protein database.
Disease database

 For the disease database, we gathered information from a variety of online sources, with the majority of the data sourced from the Mayo Clinic.[1] A total of 1,150 diseases were included in the creation of the network used for analysis.


Connection establish

NLP model database short description

 This pre-trained NLP model is developed using data from two key sources. The first source includes several renowned international journals, such as Nature, Science, and others, which are used to identify relationships between proteins and diseases. The second source consists of statistical analyses of commonly used sentences in everyday language, which aid in evaluating sentence semantics. The journal database includes all papers published in these journals up to January 2024. Additionally, papers available through the NCBI repository are incorporated as part of the data resources.

The limit to the network establish between proteins and diseases

 Our primary objective is to identify relationships between diseases and proteins. To alleviate the computational burden posed by two large databases, we structured the relationship network as a bipartite graph. In this structure, one disjoint set represents proteins, while the other corresponds to diseases. This approach reduces computational complexity by eliminating the need to calculate relationships between proteins or between diseases themselves, focusing solely on the connections between the two distinct sets.

Bipartite graph in figure
Figure 3. The bipartite graph structure of the relationship network. A bipartite graph is defined as a graph in which the vertices can be divided into two disjoint sets, where no two vertices within the same set are connected by an edge.In other words, it is a graph in which every edge connects a vertex of one set to a vertex of the other set.[2]
How this model finds the relationship with human language

1. The sentence semantic analysis
 In the NLP model, two key tasks are performed: determining the location of relevant content and analyzing whether the content is positive or negative. The content refers to the sentences in the paper that mention the marker, while the positive or negative classification indicates whether the sentence suggests that the marker contributes to the disease or helps reduce its effects.
 Initially, the model identifies the specific protein in the paper using information such as the entry ID, protein name, and the organism. Once the protein is located, the model analyzes the context of the sentence based on statistical results. For example, in the sentence, "The A protein is overexpressed when the α disease occurs," the model would classify this as positive data, while sentences indicating a reduction of the disease would be classified as negative data. The model also takes into account the arrangement of semantic components, such as Subject, Verb, and Noun, within the sentence. By leveraging these statistical analyses, we can assess the frequency of highly positive or highly negative sentences discussing proteins and diseases, thus identifying strong relationships between specific proteins and diseases.

2. The table analysis
 In addition to sentence semantic analysis, another critical component of formal papers is the data presented in tables. Many papers include percentages in tables to indicate the degree of relationship between variables. In this model, we categorize these percentages into four groups for interpretation: (75%, 100%] is classified as very highly related, (50%, 75%] as highly related, (25%, 50%] as not related, and [0%, 25%] as very not related. Additionally, the sentiment—positive or negative—is determined by analyzing the sentences linked to the table data. By combining these categorizations with the table references, we can perform deeper analyses on the papers within our database.

3. The scoring method
 The relationship score between diseases and proteins is determined by the frequency of protein mentions in research papers. For each paper, we calculate the number of positive and negative mentions of specific proteins. After reviewing the entire paper, we subtract the number of negative mentions from the positive mentions to obtain the net score for that paper. For example, if a protein is mentioned positively 4 times and negatively 2 times, its final score for that paper is 2. Since the number of papers referencing each protein varies, we calculate an average score to ensure that the final score is not skewed by uneven representation.

The generation tools introduction and the preliminary verify method

 We used PeptiMap[3] to input a protein PDB file, specifying the desired interacting chain. The tool then calculated peptide sequences that would interact with the selected protein chain.
 The PeptiMap process includes:
  ‧ Exploration of the protein surface: Fourier-based mapping docks small probe molecules to the protein surface to locate potential peptide binding sites.[4][5]
  ‧ Refinement: Docked positions are refined to identify clusters representing potential binding sites.[4][5]
  ‧ Clustering and ranking: Probe clusters are ranked by energy scores and peptide binding sites are identified within the top-ranked clusters.[4][5]
  ‧ Site filtering: Internal or buried sites inaccessible to peptides are excluded, focusing only on surface accessible regions.[4][5]
  ‧ Evaluation and validation: PeptiMap has been tested on known peptide-protein complexes, demonstrating high accuracy in identifying peptide binding sites.[4][5]
 After generating the peptide sequences, we used PEP-FOLD 4.0[6]from the RPBS web portal to predict the most likely peptide structures. The optimal structures were selected for further investigation.
 To verify the binding interaction between the peptides and our biomarker, we used iGEMDOCK[7] for docking. The peptide structures and the biomarker structures were used as inputs, resulting in a PDB file of the peptide docking and a table detailing the interactions between specific residues of the peptide and the protein, which facilitated subsequent evaluations.


The relationship analysis of generated peptide and real-world protein

 To ensure the validity of the generated peptides, we performed further analysis on their sequences and structures. For sequence validation, we performed a local alignment between the peptides and CD55 sequences using the BLOSUM62 scoring matrix. For structural validation, we used RCSB PDB's Pairwise Structure Alignment tool[8] to analyze the relative positions of the aligned structures.

The top 10 high-related disease in the database

Top 10 mentioned diseases
Figure 4. The top 10 mentioned diseases in the database.

The analysis to top 10 and the final chosen

 From Figure 4, we observe the top 10 most frequently mentioned diseases in the database. However, several entries in the list are not disease names but symptoms, such as dehydration and edema. Additionally, some diseases cover a broad spectrum, making it difficult to identify specific underlying markers. After team discussions, we decided to exclude diseases with these issues and selected leukemia as the final focus for further research.

The underline marker we found

 After identifying the diseases of interest, we compiled a list of related proteins. Our primary objective is to develop a platform capable of rapid detection. To achieve this, we prioritized markers that can be easily detected from common samples, such as blood or saliva. Ultimately, CD97 emerged as the most highly associated protein, serving as a key underlying marker. Furthermore, we discovered that CD97, when combined with established leukemia markers such as TP53 and RUNX1[9], enhances the precision of leukemia detection.

Venn diagram of CD97 and other markers commonly used for leukemia detection
Figure 5. Venn diagram of CD97 and other markers commonly used for leukemia detection. The diagram shows that the presence of CD97 is associated with three diseases: intrahepatic cholangiocarcinoma, colorectal carcinoma, and our primary target, acute myeloid leukemia (AML). To improve the specificity of detecting AML, we propose combining CD97 with additional markers present in blood or saliva, as shown in the diagram.[9][10][11][12]

The generated short peptides

The sequences of three generated peptides

 First, we generated peptide sequences based on the CD97 structure, the following peptide sequences were generated with PeptiMap:
  ‧ 2BOU-1: VCSPGYEPAKVEQGSYK
  ‧ 2BOU-2: RCNPGFSSSEIII
  ‧ 2BOU-3: DWVCSP

The structures of three peptides generated by PEP-FOLD 4.0
stucture of 2BOU-1
Figure 6. The structure of 2BOU-1.
stucture of 2BOU-2
Figure 7. The structure of 2BOU-2.
stucture of 2BOU-3
Figure 8. The structure of 2BOU-3.

 Due to software limitations 2BOU-1 could not be docked, but the docking results for 2BOU-2 and 2BOU-3 are provided. As the figure 7 and figure 8 shows, CD97 is shown in green, the peptides in blue and the interacting residues of CD97 in dark green.

Figure 9. The docking result of 2BOU-2 and CD97.
The figure shows that there is no collision between 2BOU-2 and CD97.
Figure 10. The docking result of 2BOU-3 and CD97.
The figure shows that there is no collision between 2BOU-3 and CD97.

 According to the iGEMDOCK results, 2BOU-2 interacts with CD97 at Ser-59, Asp-60, Trp-62, Val-71, Cys-72, Ser-73, Pro-74, Gly-75, Tyr-76, Glu-77, Pro-78, Lys-83, Val-96, Gln-103, while 2BOU-3 interacts with residues such as Cys-47, Ala-48, Ser-54, Cys-55, Ser-59, Asp-60, Cys-61, Trp-62. As 2BOU-1 had no docking results, it was excluded from further study.


The relationship of generated short peptides and real-world protein

 Melania Capasso, Lindy G. Durrant, and Martin Stacey have discovered that CD55 would bind with CD97.[13] Thus, we took CD55 to compare with our result.

 The results of the local alignment are shown in the following figures:
stucture of 2BOU-1
Figure 11. The local alignment result of 2BOU-2 and CD55 with the sequence of 2BOU-2.
The result indicates a high similarity between the two sequences, with four out of six residues matching, suggesting a potential conserved function or structure in these regions.
stucture of 2BOU-2
Figure 12. The local alignment result of 2BOU-3 and CD55 with the sequence of 2BOU-3.
The results hows some similarity between the two sequences, with two exact matches and two conservative substitutions out of six positions, indicating partial conservation between the sequences.

 Here is the result of pairwise structure alignment. 2BOU-3 did not meet the minimum peptide length requirement for alignment, so there is no result for 2BOU-3.

The structure alignment result represented with sequence
Figure 13. The structure alignment result represented with sequence. In the figure, it shows that it is almost in full alignment.
The structure alignment result represented with structure
Figure 14. The structure alignment result represented with structure.
The orange one is CD55, and the blue one is 2BOU-2.The structure alignment shows a significant level of similarity between the protein and the peptide, with key structural elements aligning closely. This suggests that they may share similar functional properties.

 Furthermore, our research revealed that the binding of CD97 and CD55 is strongly associated with residues 1-4, 48-54, 77-83, 124-127, 130-137 on CD97 and residues 5-12, 40-45, 56-61, 67-74, 78-91, 98-109, 114-125 on CD55. The docking and Pairwise Structure Alignment results for 2BOU-2 show a high degree of overlap with the reported binding sites, increasing confidence in its interaction with CD97.[14]


Model-base software that can search the underline markers

 Based on the NLP-based semantic analysis model, we developed a software tool called Artemis’ Eye to assist researchers in easily identifying underlying markers. Additionally, the software labels potential markers that can be detected in body fluids. With Artemis’ Eye, researchers can efficiently locate potential markers and proceed with experiments to validate these findings using the platform designed in our project.
 To know more or try to use the software, please go to the software page.



Reference

  1. Mayo Clinic. (n.d.). Mayo Clinic. https://www.mayoclinic.org/
  2. GeeksforGeeks. (n.d.). What is a bipartite graph? GeeksforGeeks. https://www.geeksforgeeks.org/what-is-bipartite-graph/
  3. PeptiMap. (n.d.). PeptiMap server. ClusPro. Retrieved September 8, 2024, from https://peptimap.cluspro.org/
  4. Lavi A, Ngan CH, Movshovitz-Attias D, Bohnuud T, Yueh C, Beglov D, Schueler-Furman O, Kozakov D Detection of peptide-binding sites on protein surfaces: The first step towards the modeling and targeting of peptide-mediated interactions PROTEINS: Structure, Function, and Bioinformatics 2013 Oct 17.
  5. Brenke R, Kozakov D, Chuang GY, Beglov D, Hall D, Landon MR, Mattos C, Vajda S. Fragment-based identification of druggable 'hot spots' of proteins using Fourier domain correlation techniques. Bioinformatics. 2009 Mar 1.
  6. RPBS Mobyle. (n.d.). PEP-FOLD4. Mobyle2 RPBS. Retrieved September 8, 2024, from https://mobyle2.rpbs.univ-paris-diderot.fr/cgi-bin/portal.py#forms::PEP-FOLD4
  7. National Chiao Tung University. (n.d.). GEMDOCK server. GEMDOCK. Retrieved September 8, 2024, from http://gemdock.life.nctu.edu.tw/
  8. RCSB PDB. (n.d.). Sequence alignment tool. RCSB Protein Data Bank. Retrieved September 8, 2024, from https://www.rcsb.org/alignment
  9. Small S, Oh TS, Platanias LC. Role of Biomarkers in the Management of Acute Myeloid Leukemia. Int J Mol Sci. 2022 Nov 22;23(23):14543. doi: 10.3390/ijms232314543. PMID: 36498870; PMCID: PMC9741257.
  10. Molica M, Perrone S, Mazzone C, Niscola P, Cesini L, Abruzzese E, de Fabritiis P. CD33 Expression and Gentuzumab Ozogamicin in Acute Myeloid Leukemia: Two Sides of the Same Coin. Carcinomas (Basel). 2021 Jun 28;13(13):3214. doi: 10.3390/Carcinomas13133214. PMID: 34203180; PMCID: PMC8268215.
  11. Vaikari, V. P., Wu, S., & Alachkar, H. (2018). CD97 is associated with poor overall survival in acute myeloid leukemia. Blood, 132(Supplement 1), 2794. https://doi.org/10.1182/blood-2018-99-120324
  12. Martin, G. H., Desrichard, A., Chung, S. S., Woolthuis, C., Hu, W., Garrett-Bakelman, F. E., Hamann, J., Chan, T., & Park, C. Y. (2016). CD97 is a critical regulator of acute myeloid leukemia stem cell function. Blood, 128(22), 1077. https://doi.org/10.1182/blood.V128.22.1077.1077
  13. Melania Capasso, Lindy G. Durrant, Martin Stacey, Siamon Gordon, Judith Ramage, Ian Spendlove; Costimulation via CD55 on Human CD4+ T Cells Mediated by CD97. J Immunol 15 July 2006; 177 (2): 1070–1077. https://doi.org/10.4049/jimmunol.177.2.1070
  14. Abbott, R. J. M., Spendlove, I., Roversi, P., Fitzgibbon, H., Knott, V., Teriete, P., McDonnell, J. M., Handford, P. A., & Lea, S. M. (2007). Structural and functional characterization of a novel T cell receptor co-regulatory protein complex, CD97-CD55. Journal of Biological Chemistry, 282(30), 22023–22032. https://doi.org/10.1074/jbc.M702588200