General Introduction
Endocrine Disrupting Chemicals (EDCs) are natural or synthetic compounds that can mimic, block, or interfere with the human body’s hormones, disrupting the endocrine system. With more than 1,000 identified EDCs and many more chemicals showing potential endocrine-disrupting activity, these compounds interact with a wide range of hormonal targets (receptors, hormones etc.). However, the strength and specificity of these interactions can vary significantly [1]. To obtain the specific information regarding the EDC/Target interactions and conduct fundamental research the scientists must undergo the tedious process of reading numerous scientific articles.
There are different methodologies that are currently used by the researchers for that matter with the most commonly used one being the traditional literature parsing. Here the researchers have to manually search through vast databases like PubMed cross-reference the results of the studies by reading even more articles and then identify potential leads for their project. This method although thorough, is very time-consuming and can introduce some selection bias by the researcher. In addition, conflicts between experimental data findings and in silico simulations lead to confusion. Hence, parsing and validating a particular segment of the literature might slow down the progress by scientific teams.
The highlights the need in automating the scientific literature parsing process. A huge breakthrough was made with the launch of ChatGPT in November 2022 by OpenAI and since then AI models and GPTs are used excessively. The current AI tools offer a faster and cheaper alternative to access scientific information. One of the technologies that are being used by some of those AI models is called Natural Language Processing (NLP). NLP is a subfield of Machine Learning and Artificial Intelligence, which is increasingly recognized as a transformative tool from the scientific community [2]. NLP models are highly modular with several general NLP techniques and approaches being applied to the field of synthetic biology. The flexibility NLP models offer, allow the development of custom models based on the specifics of each field.
SENTINEL
Our team’s mission is to make use of the NLP technology to automate the scientific literature review process for every iGEM team. Our idea was inspired by our struggle to find relevant literature in the early stages of a project (brainstorming, early planning etc.), which are the most relevant for a successful iGEM project because the earlier a team decides on a project, the earlier it can begin lab work. We believe that many iGEM teams faced the same problem as ours and we aim to fix it. That is why we, the DTU Bioduilders, developed SENTINEL (Sentence Extracting Networked Target Information using NLP and Exploration of Literature) a user-friendly NLP database developed to identify interactions between compounds and their targets and the type of their interaction, as they are mentioned in the scientific literature.
For our proof of concept, and to move our overall project forward we developed two custom Name Entiy Recognition (NER) models:
The data we used to train and develop our NER models were experimental articles retrieved from the PubMed database using the Entrez API using a custom query that combined EDC (Endocrine Disrupting Compounds) and target names as they were found in DeDUCT (Database of Endocrine Disrupting Chemicals and their Toxicity Profiles) [3], EDKB (Endocrine Disruptor Knowledge Base) and ChEMBL API databases respectively.
We conducted a bipartite network analysis using the Cytoscape API. The network consists of two distinct sets of nodes (chemicals and targets), with connections only between sets. This analysis helps users identify which EDCs affect multiple targets, indicating broader biological impacts, and highlights molecular targets influenced by multiple EDCs, useful for biosensor development. Edge lengths reflect interaction frequency, with shorter edges indicating stronger associations, allowing users to assess the robustness of these relationships based on existing literature.
Technical information
The NER models that SENTINEL is built upon are from spaCy and scispaCy and are the following.
- en_ner_bionlp13cg_md: A pretrained scispaCy model for biomedical entity recognition.
- en_ner_bc5cdr_md: A pretrained scispaCy model, particularly designed to recognize chemical and disease entities.
- Custom NER models: We also employed our custom-trained models that focus on identifying EDCs, molecular targets, and biological activity tokens.
- learning rate: 0.001
- epochs: 5600
- batch_size: 1000
The hyperparameters used for our models were:
All NER models were enhanced with the scispaCy library AbbreviationDetector, which allows us to capture abbreviations and their corresponding long forms and by spaCy’ s library Sentencizer to help detect sentence boundaries in the text.
Our NER pipeline successfully identified a substantial number of specific entity mentions in the literature. The EDC-Target interaction specific entities that we extracted are:
- EDCs: 26,300 entities
- Activity: 28,023 entities
- Molecular Target: 4,198 entities
Table 1 below provides a comparison between Traditional database searching, ChatGPT, and SENTINEL.
Feature | Traditional Database Searching | ChatGPT | SENTINEL |
---|---|---|---|
Precision | Moderate | Moderate | High |
Speed | Slow | Fast | Fast |
Information Structure | Unstructured | Unstructured | Structured |
Cross-referencing of Data | Manual | No | Automated |
Adaptability to Specific Needs | Low | Moderate | High |
Risk of False Information | Low | High | Low |
Contribution
SENTINEL provides a faster and comprehensive way to mine scientific literature for insightful information. The overall value of our project is displayed below:
1. Enhanced Data Accessibility
Our database provides a centralized resource where iGEM teams can easily access and query information about EDCs, their targets, and how those targets are affected.
SENTINEL combines information from different studies into a single database, allowing users to view connections and gather insights from various sources.
2. Informed Decision-Making
Using SENTINEL, users can quickly identify which EDCs target specific biological pathways or proteins of interest, facilitating more informed decision-making in their experimental designs.
3. Visualization and Analysis
By visualizing the relationships between EDCs and targets in a network format, users can identify complex interactions and patterns not immediately apparent from tabular data.
4. Collaborative Research
SENTINEL can be shared with other teams and researchers, promoting collaboration and the sharing of insights and findings.
It provides a standardized way to represent and analyze data, facilitating collaboration across different iGEM teams.
We wanted to make SENTINEL as accessible and user-friendly as possible, we decided to package the entire pipeline within a Docker container which will contain all the necessary dependencies and libraries required for the database to launch, thereby eliminating the often challenging task of managing and installing multiple dependencies on various operating systems.
Software Tool Success
Our software tool, SENTINEL, has already proven to be a valuable asset in the research and development of EDCs. One of the first key collaborators to utilize SENTINEL is the Team MilkClear UCPH iGEM from the University of Copenhagen, as part of the 2024 iGEM competition.
MilkClear leveraged SENTINEL to create a specialized catalog of receptors that can be integrated into biosensors for detecting EDCs. Their use of SENTINEL streamlined the identification process and empowered their project with a robust database of receptor options. This collaboration was strengthened when we met up at the NiC Conference in Turku, Finland, where we shared insights and discussed future innovations
Interested in using our database? Visit our GitLab , and start exploring all its functions.
Instructions
SENTINEL Manual
Our team made the best efforts to develop a user-friendly interface that could be used by every iGEM-er without compromising the efficiency of our software. In order to use SENTINEL the user has to firstly clone the dtu-biobuilders repository and locally activate our database. Then the user gets introduced to SENTINEL’ s homepage (Figure 3).
Step 1. Type a query (i.e. Molecular Target) (Figure 4)Step 2. Press Enter
The results are a cytoscape network visualization, generated by our network analysis, containing the molecular target in the center of the network and a table (Figure 5, Figure 6, Figure 7) with the necessary information (EDC, Target, Activity, Articles, Details) The 'Details' column is a collapsible column that allows the user to view the real sentence as it is found in the text (Figure 7).
Data Mining & Data Preparation
To build SENTINEL we developed 2 pipelines. The first pipeline outlines the essential steps to transform unstructured text data into a structured and standardized format, making it suitable for the development of NER models. This processed data is then used to create ground truth data and build the NER model. Custom NER models are necessary, as pretrained models, though highly powerful, may not fully meet the specific requirements of a project. The second pipeline was designed to build, train, validate, and select the most optimal NER model. We developed and share those pipelines with the purpose of helping every future iGEM-er to build their own/custom NER models from scratch.
This is our recommended pipeline to mine, structure and preprocess the text data (Figure 8).
Mine & transform the data for spaCy model
Step 1. Query construction: Search for relevant literature on PubMed using queries. This will help us retrieve relevant scientific articles from PubMed
Step 2. XML retrieval and scrapping: After performing the search step, we retrieved article metadata and full-text records in XML format. We parsed this XML data using the BeautifulSoup package in Python for web scraping. This allowed us to extract and organize the relevant information into a metadata CSV file, which included:
- PMID: The PubMed ID for reference.
- Authors: The list of authors who contributed to the study.
- Title: The title of the article.
- Publication Date: The year of publication.
- PMC ID: If available, the PubMed Central ID to retrieve full text.
- DOI: The article's Digital Object Identifier.
- Section: The specific section of the article.
- Article Text: The text content of the article.
Step 3. Text cleaning and Tokenization: In this step, we clean the text by removing unwanted characters, such as quotes, brackets, and special symbols, which could interfere with the model.
Once the text is cleaned,we tokenize the sentences using Python’s Natural Language Toolkit (NLTK). This tokenization allow us to work with manageable pieces of data for further analysis, ensuring each sentence can be processed independently.
Custom NER Model Development
This is our recommended pipeline to develop a custom NER spaCy model (Figure 9).
Build your Own spaCy NER Model
Step 1. Load the data: Load the full dataset of the processed articles created by the first pipeline and randomly select 15% of them. The randomization in this step will ensure diverse coverage of different entities and save them in a separate txt file for manual annotation.
Step 2. Manual Annotation: This step is crucial for creating high-quality training data. Human experts manually label relevant entities (e.g., chemicals, receptors, diseases, proteins) in 15% of the dataset to build our ground truth dataset. In our project, the DTU BioBuilders dry lab team members served as the experts. The annotations follow a structured format compatible with spaCy for model training. To perform the manual annotation, we used the NER text annotator provided by spaCy developers, NER Annotator.
Step 3.Train the spaCy Models: Once the 15% of the dataset is manually annotated, they are converted in a JSON format which is compatible with spaCy. Now it is time to train the model!!!! The training process fine-tunes the model’s which now will accurately predict the named entities based on the labeled data.
Step 4. Model Evaluation: During training, spaCy tracks the model’s performance using several key metrics. After each training run, we evaluate the models based on the following:
- LOSS_TOK2VEC: The loss function for token vectors (how well the model represents words).
- LOSS_NER: The loss function for named entity recognition (how well the model identifies entities)
- ENTS_F: F1 score for entity recognition, balancing precision and recall.
- ENTS_P: Precision score, measuring how many of the model’s predicted entities are correct.
- ENTS_R: Recall score, measuring how many of the true entities were correctly predicted.
- SCORE: Overall evaluation score.
These metrics guide us to select the model that best balances identifying entities correctly and comprehensively.
Step 5. Model selection: In this step you pick the model with the best metrics
Well Done! You have just created your custom NER model.
Limitations & Future Oulook
Our tool although it provides useful information has its limitations. More specifically, SENTINEL is not yet able to capture all the interactions that are found in the literature between EDC and their molecular targets due to the limited size and quality of the data. In addition, the smaller number of sentences retrieved could be the result of our model required EDC and molecular targets information in the same sentence. This highlights the need for the development of more advanced models that could perform NLP in entire paragraphs and not just sentences. In addition, the NER models could not be generalized to their full potential due to the very strict and specific format that scientific articles have.
In the future, we plan on expanding SENTINEL to an all-in-one NER database for molecule-molecule interactions, with the direct references to the corresponding scientific articles that could be used by every iGEM-er. To achieve this, we plan to harness vast datasets sourced from the entire internet as a foundation for more advanced NLP models, like BERT and SciBERT. Additionally, we aim to enhance SENTINEL's functionality by integrating it with established databases like chEMBL and PubChem. This integration aims to provide valuable bioreactivity information for molecules in the database, significantly enhancing the platform's depth and utility.