Model

Database establish

Limits to data

Given the substantial volume of data, the time and space complexity of running the model on a local device becomes prohibitive. To mitigate this issue, we imposed certain constraints. From prior analysis, it was observed that the majority of protein sequences in the database are under 2,000 amino acids (A.A.) in length. Consequently, we excluded any sequences exceeding this length.
Similarly, for the disease database, we determined that symptoms could not be used as reliable identifiers due to the large number of diseases associated with the same symptom.(See more in the Engineering page) Therefore, we excluded mostly symptom data from our analysis.

Data collection method

1. Database description
To conduct an analysis of relationships, it is necessary to establish two distinct databases: one for proteins and another for diseases. By linking the data from these two databases, we will create a network to identify and analyze the relationships between them.

2. Protein database establish
For the protein database, we collect data from UniProt, and focus specifically on the reviewed entries. This ensures that the information we used has been manually curated by expert biologists, providing a higher level of reliability and accuracy.

3. Disease database establish
Regarding the disease database, we reviewed multiple databases that store information on symptoms and disease names, including both vernacular and scientific nomenclature. After compiling this data, duplicate names will be removed to ensure the accuracy and integrity of our dataset.

The information of two databases

Protein database

1. Amount
We gathered 234,484 data from Uniprot and stored them into the database.

2. Length
As to the length of data, here is the distribution of whole data whose sequence is under 2,000 A.A.

Figure 2. The length distribution of the whole protein database.

Disease database

For the disease database, we gathered information from a variety of online sources, with the majority of the data sourced from the Mayo Clinic.^[1] A total of 1,150 diseases were included in the creation of the network used for analysis.

Connection establish

NLP model database short description

This pre-trained NLP model is developed using data from two key sources. The first source includes several renowned international journals, such as Nature, Science, and others, which are used to identify relationships between proteins and diseases. The second source consists of statistical analyses of commonly used sentences in everyday language, which aid in evaluating sentence semantics. The journal database includes all papers published in these journals up to January 2024. Additionally, papers available through the NCBI repository are incorporated as part of the data resources.

The limit to the network establish between proteins and diseases

Our primary objective is to identify relationships between diseases and proteins. To alleviate the computational burden posed by two large databases, we structured the relationship network as a bipartite graph. In this structure, one disjoint set represents proteins, while the other corresponds to diseases. This approach reduces computational complexity by eliminating the need to calculate relationships between proteins or between diseases themselves, focusing solely on the connections between the two distinct sets.

Bipartite graph in figure — Figure 3. The bipartite graph structure of the relationship network. A bipartite graph is defined as a graph in which the vertices can be divided into two disjoint sets, where no two vertices within the same set are connected by an edge.In other words, it is a graph in which every edge connects a vertex of one set to a vertex of the other set.^[2]

How this model finds the relationship with human language

1. The sentence semantic analysis
In the NLP model, two key tasks are performed: determining the location of relevant content and analyzing whether the content is positive or negative. The content refers to the sentences in the paper that mention the marker, while the positive or negative classification indicates whether the sentence suggests that the marker contributes to the disease or helps reduce its effects.
Initially, the model identifies the specific protein in the paper using information such as the entry ID, protein name, and the organism. Once the protein is located, the model analyzes the context of the sentence based on statistical results. For example, in the sentence, "The A protein is overexpressed when the α disease occurs," the model would classify this as positive data, while sentences indicating a reduction of the disease would be classified as negative data. The model also takes into account the arrangement of semantic components, such as Subject, Verb, and Noun, within the sentence. By leveraging these statistical analyses, we can assess the frequency of highly positive or highly negative sentences discussing proteins and diseases, thus identifying strong relationships between specific proteins and diseases.

2. The table analysis
In addition to sentence semantic analysis, another critical component of formal papers is the data presented in tables. Many papers include percentages in tables to indicate the degree of relationship between variables. In this model, we categorize these percentages into four groups for interpretation: (75%, 100%] is classified as very highly related, (50%, 75%] as highly related, (25%, 50%] as not related, and [0%, 25%] as very not related. Additionally, the sentiment—positive or negative—is determined by analyzing the sentences linked to the table data. By combining these categorizations with the table references, we can perform deeper analyses on the papers within our database.

3. The scoring method
The relationship score between diseases and proteins is determined by the frequency of protein mentions in research papers. For each paper, we calculate the number of positive and negative mentions of specific proteins. After reviewing the entire paper, we subtract the number of negative mentions from the positive mentions to obtain the net score for that paper. For example, if a protein is mentioned positively 4 times and negatively 2 times, its final score for that paper is 2. Since the number of papers referencing each protein varies, we calculate an average score to ensure that the final score is not skewed by uneven representation.

The generation tools introduction and the preliminary verify method

We used PeptiMap^[3] to input a protein PDB file, specifying the desired interacting chain. The tool then calculated peptide sequences that would interact with the selected protein chain.
The PeptiMap process includes:
‧ Exploration of the protein surface: Fourier-based mapping docks small probe molecules to the protein surface to locate potential peptide binding sites.^[4][5]
‧ Refinement: Docked positions are refined to identify clusters representing potential binding sites.^[4][5]
‧ Clustering and ranking: Probe clusters are ranked by energy scores and peptide binding sites are identified within the top-ranked clusters.^[4][5]
‧ Site filtering: Internal or buried sites inaccessible to peptides are excluded, focusing only on surface accessible regions.^[4][5]
‧ Evaluation and validation: PeptiMap has been tested on known peptide-protein complexes, demonstrating high accuracy in identifying peptide binding sites.^[4][5]
After generating the peptide sequences, we used PEP-FOLD 4.0^[6]from the RPBS web portal to predict the most likely peptide structures. The optimal structures were selected for further investigation.
To verify the binding interaction between the peptides and our biomarker, we used iGEMDOCK^[7] for docking. The peptide structures and the biomarker structures were used as inputs, resulting in a PDB file of the peptide docking and a table detailing the interactions between specific residues of the peptide and the protein, which facilitated subsequent evaluations.

The relationship analysis of generated peptide and real-world protein

To ensure the validity of the generated peptides, we performed further analysis on their sequences and structures. For sequence validation, we performed a local alignment between the peptides and CD55 sequences using the BLOSUM62 scoring matrix. For structural validation, we used RCSB PDB's Pairwise Structure Alignment tool^[8] to analyze the relative positions of the aligned structures.

The top 10 high-related disease in the database

Top 10 mentioned diseases — Figure 4. The top 10 mentioned diseases in the database.

The analysis to top 10 and the final chosen

From Figure 4, we observe the top 10 most frequently mentioned diseases in the database. However, several entries in the list are not disease names but symptoms, such as dehydration and edema. Additionally, some diseases cover a broad spectrum, making it difficult to identify specific underlying markers. After team discussions, we decided to exclude diseases with these issues and selected leukemia as the final focus for further research.

The underline marker we found

After identifying the diseases of interest, we compiled a list of related proteins. Our primary objective is to develop a platform capable of rapid detection. To achieve this, we prioritized markers that can be easily detected from common samples, such as blood or saliva. Ultimately, CD97 emerged as the most highly associated protein, serving as a key underlying marker. Furthermore, we discovered that CD97, when combined with established leukemia markers such as TP53 and RUNX1^[9], enhances the precision of leukemia detection.

Figure 5. Venn diagram of CD97 and other markers commonly used for leukemia detection. The diagram shows that the presence of CD97 is associated with three diseases: intrahepatic cholangiocarcinoma, colorectal carcinoma, and our primary target, acute myeloid leukemia (AML). To improve the specificity of detecting AML, we propose combining CD97 with additional markers present in blood or saliva, as shown in the diagram.^{[9][10][11][12]}

The generated short peptides

The sequences of three generated peptides

First, we generated peptide sequences based on the CD97 structure, the following peptide sequences were generated with PeptiMap:
‧ 2BOU-1: VCSPGYEPAKVEQGSYK
‧ 2BOU-2: RCNPGFSSSEIII
‧ 2BOU-3: DWVCSP

The structures of three peptides generated by PEP-FOLD 4.0

stucture of 2BOU-1 — Figure 6. The structure of 2BOU-1.

stucture of 2BOU-2 — Figure 7. The structure of 2BOU-2.

stucture of 2BOU-3 — Figure 8. The structure of 2BOU-3.

Due to software limitations 2BOU-1 could not be docked, but the docking results for 2BOU-2 and 2BOU-3 are provided. As the figure 7 and figure 8 shows, CD97 is shown in green, the peptides in blue and the interacting residues of CD97 in dark green.

Figure 9. The docking result of 2BOU-2 and CD97.
The figure shows that there is no collision between 2BOU-2 and CD97.

Figure 10. The docking result of 2BOU-3 and CD97.
The figure shows that there is no collision between 2BOU-3 and CD97.

According to the iGEMDOCK results, 2BOU-2 interacts with CD97 at Ser-59, Asp-60, Trp-62, Val-71, Cys-72, Ser-73, Pro-74, Gly-75, Tyr-76, Glu-77, Pro-78, Lys-83, Val-96, Gln-103, while 2BOU-3 interacts with residues such as Cys-47, Ala-48, Ser-54, Cys-55, Ser-59, Asp-60, Cys-61, Trp-62. As 2BOU-1 had no docking results, it was excluded from further study.

The relationship of generated short peptides and real-world protein

Melania Capasso, Lindy G. Durrant, and Martin Stacey have discovered that CD55 would bind with CD97.^[13] Thus, we took CD55 to compare with our result.

The results of the local alignment are shown in the following figures:

Here is the result of pairwise structure alignment. 2BOU-3 did not meet the minimum peptide length requirement for alignment, so there is no result for 2BOU-3.

Figure 13. The structure alignment result represented with sequence. In the figure, it shows that it is almost in full alignment.

Figure 14. The structure alignment result represented with structure.
The orange one is CD55, and the blue one is 2BOU-2.The structure alignment shows a significant level of similarity between the protein and the peptide, with key structural elements aligning closely. This suggests that they may share similar functional properties.

Furthermore, our research revealed that the binding of CD97 and CD55 is strongly associated with residues 1-4, 48-54, 77-83, 124-127, 130-137 on CD97 and residues 5-12, 40-45, 56-61, 67-74, 78-91, 98-109, 114-125 on CD55. The docking and Pairwise Structure Alignment results for 2BOU-2 show a high degree of overlap with the reported binding sites, increasing confidence in its interaction with CD97.^[14]

Model-base software that can search the underline markers

Based on the NLP-based semantic analysis model, we developed a software tool called Artemis’ Eye to assist researchers in easily identifying underlying markers. Additionally, the software labels potential markers that can be detected in body fluids. With Artemis’ Eye, researchers can efficiently locate potential markers and proceed with experiments to validate these findings using the platform designed in our project.
To know more or try to use the software, please go to the software page.

Short Description

Flowchart

Press buttons on the top to see different parts of model